Common use of Threats to Validity Clause in Contracts

Threats to Validity. A number of potential limitations in the form of threats to the validity of our study were considered. These were threats to internal and external validity. In this study we focused on evaluating the reliability of assessments based on the SPICE framework. We implicitly assumed that reliability is only a function of the SPICE documents and architecture (e.g., the clarity of practice definitions, the soundness of the rating scheme, and the applicability of the two-dimensional architecture). Threats to internal validity would question this assumtpion. One potential threat to internal validity is a maturation effect. In this study, a maturation effect would be indicated by a change in interrater agreement (as measured by Kappa) over the course of the assessment. For example, as the assessment progresses, assessors may become more fatigued and pay less attention to observing evidence and in making their ratings. This would tend to decrease the extent of interrater agreement as the assessment progresses. Conversely, assessors may gain knowledge of the organization and the way it implements its practices as time progresses. As more evidence is gathered by assessors they may start to converge in their perceptions about the capability of the organization’s processes. This could lead to an increase in interrater agreement as the assessment progresses. If we find a maturation effect then the values of Kappa that we obtained are also a function of when ratings are made during an assessment. To determine if there was a maturation effect, we conducted a number of post-hoc tests. The assessment ratings were made over a 2.5 day period (the whole assessment was longer since it included an initial meeting with management and a closing session where findings were presented). Evidence on nine processes was inspected and ratings were made in the first 1.5 days of the assessment. These were classified as early processes. The remaining six processes were rated in the final day. These were classified as late processes. We tested for differences in the values of Kappa between these two groups. We used a two-tailed test at an alpha level of 0.1. The statistic we used was the ▇▇▇▇-▇▇▇▇▇▇▇ U test [22]. No diferences were found, and hence there is no evidence that the median Kappa values between the two groups differed. Three different external assessors and five different internal assessors took part in the assessment. The distribution of assessors over time was not uniform, and therefore the maturation effect may be occuring at a different rate for different assessors. For example, one internal assessor took part in assessing only one late process, and another took part only in assessing early processes. Therefore, for these two assessors there is no maturation effect. An alternative way of measuring progress through the assessment would be the number of processes assessed thus far by the assessors, instead of using time. We calculated the robust ▇▇▇▇▇▇▇▇ rho coefficient [22] between the number of processes assessed thus far and Kappa. This was done for the internal assessor only, for the external assessor only, and for the sum of the number of proceses assessed thus far for both the internal and external assessor. The rho coefficient was not statistically significant using a two-tailed test at an alpha level of 0.1. Therefore we could not find evidence of a maturation effect. 5 Another potential threat to validity is a selection effect. Where high disagreement was found, differences in capability levels between the internal and external assessor may explain the disagreement. External assessors will tend to have experience with a variety of different organizations and hence more knowledge of different ways of implementing SPICE processes. Also, they would tend to have more experience with assessments. We attempted to counteract this by giving the internal assessors a five day course on SPICE and on process assessments. Internal assessors will tend to have more knowledge of the organization’s business, needs, and constraints. However, knowledge of the organization is not considered as a prerequisite in the qualification guidance for SPICE assessors [15]. In terms of general and software education, software training and software experience no discernable differences between the internal and external assessors were 5 Note that we performed a post-hoc power analysis of these results. Statistical power is the probability that a statistical test will correctly reject the null hypothesis (in this case that the correlation coefficient is zero). We found that the power of the statistical test for these correlations was less than 30% using the tables in [19]. This is a low power level. Therefore, the statistical test used was not powerful enough to detect a maturation effect of the size found in our study. The small sample size is a major contributor to the low power level witnessed here. Similar evaluations using the ▇▇▇▇▇▇▇ correlation, after removal of an outier observation, do not change the general conclusions.

Appears in 2 contracts

Sources: Assessor Agreement, Assessor Agreement