Common use of Interpreting Interrater Agreement Clause in Contracts

Interpreting Interrater Agreement. After calculating the value of Kappa, the next question is “how do we interpret it?” There are two general approaches for interpreting such measures. The first is with comparison to previously established baselines. However, given that there are no such baselines in software engineering, this approach is not feasible. The second approach is to establish some general benchmarks based on factors such as: what has been learned and accepted in other disciplines, experience within our own discipline, and our intuition. As a body of empirical knowledge is accumulated on software process assessments, we would evolve these benchmarks to take account of what has been learned. We resorted to guidelines developed and accepted within other disciplines. To this end, ▇▇▇▇▇▇ and ▇▇▇▇ [13] have presented a table that is useful and commonly applied for benchmarking the obtained values of Kappa. This is shown in Figure 7. ▇▇▇▇▇▇▇ [8] notes that while this table is arbitrary, it is still potentially useful for interpreting values of Kappa. In addition, we can test the hypothesis of whether the obtained value of Kappa meets a minimal requirement (following the procedure in [9]). The logic for a minimal benchmark requirement is that it should act as a good discriminator between assessments conducted with a reasonable amount of rigor and precision, and those where there was much misunderstanding and confusion about how to rate practices. It was thus deemed reasonable to require that agreement be at least moderate (i.e., Kappa > 0.4). Based on the results reported here and other studies already completed [5], this minimal value was perceived as a good discriminator. It should be cautioned, however, that the benchmark that we suggest above should only be considered initial. If, after further empirical study, it was found that this benchmark fails all SPICE processes, pass all of them, or pass ones that intuitively should be failed and vice versa, then the benchmark should be modified to strengthen or weaken the requirement.

Appears in 2 contracts

Sources: Assessor Agreement, Assessor Agreement

Interpreting Interrater Agreement. After calculating the value of Kappa, the next question is “how do we interpret it?” There are two general approaches for interpreting such measures. The first is with comparison to previously established baselines. However, given that there are no such baselines precedents of interrater agreement studies in software engineering, this approach is not feasible. The second approach is to establish some general benchmarks based on factors such as: what has been learned and accepted in other disciplines, experience within our own discipline, and our intuition. As a body of empirical knowledge is accumulated on software process assessments, we would evolve these benchmarks to take account of what has been learned. We resorted resort to follow the guidelines developed and accepted within other disciplines. To this end, ▇▇▇▇▇▇ and ▇▇▇▇ [1311] have presented a table that is useful and commonly applied for benchmarking the obtained values of Kappa. This is shown in Figure 7. ▇▇▇▇▇▇▇ [8] notes that while this table is arbitrary, it is still potentially useful for interpreting values of Kappa6. In addition, we can test the hypothesis of whether the obtained value of Kappa meets a minimal requirement (following the procedure in [97]). The logic for a minimal benchmark requirement is that it should act as a good discriminator between assessments conducted with a reasonable amount of rigor and precision, and those where there was much misunderstanding and confusion about how to rate practices. It was thus deemed reasonable to require that agreement be at least moderate (i.e., Kappa > 0.4). Based on the results reported here and other studies already completed [5], this This minimal value was perceived as a good discriminator. It should be cautioned, however, that the benchmark that we suggest above should only be considered initial. If, after further empirical study, it was found that this benchmark fails these benchmarks fail all SPICE processes, pass all of them, or pass ones that intuitively should be failed and vice versa, then the benchmark should be modified to strengthen or weaken the requirement.

Appears in 1 contract

Sources: Technical Report