Dependent Variables. Study 2, Individual Level, Classification Consistency For individuals, we addressed the classical statistical question whether a particular sample result is repeatable in a new sample. For measurement, repeatability refers to re-administering the same test to the same individuals a great number of times, and then determining to which degree the same conclusion about an individual was drawn. Figure 1 shows two propensity distributions (the small distributions) and the corresponding proportions p of correct classification—rejection for person A and selection for person B. Obviously, proportion p is larger as someone’s true score is located further away from the cut score, and the question is whether short tests produce too many small p values; that is, too many applicants on either side of the cut score that would be misclassified too often based on repeated observation of their test performance. ▇▇▇▇▇ and colleagues (2007) suggested that organizations set a lower bound to p, so that they make explicit what they think is the minimally acceptable certainty one requires for decisions about an individual. For example, an organization could require p to be at least .9, meaning that upon repetition an individual may be not be Downloaded by [Radboud Universiteit Nijmegen] at 06:01 04 November 2013 + − − + = = − + misclassified more often than in 10% of the test repetitions. Thus, for the propen- sity distribution on the right in Figure 1 the unshaded area must cover no more than 10% of the total area. The quality of a particular personnel-selection scenario and a particular test or test battery may be measured by the proportion of unsuited individuals (for whom T < XC) that have p values of .9 or higher; this is called classification consistency and denoted CC (the minus sign refers to the rejection area). Likewise, the classification consistency CC for the selection area could be determined. Large CC values suggest that the personnel-selection scenario and the test or test battery used produce certainty about individual decisions for many ap- plicants. The question is whether CC and CC are large enough when short tests are used. We studied CC and CC for lower bound values p .7 and p .9. + − ▇▇▇▇▇▇▇▇▇ and ▇▇▇▇▇▇ (1997) and ▇▇▇▇▇▇▇ and ▇▇▇▇▇▇ (2002) discussed similar approaches to rejection/selection problems but used different outcome measures; see Emons and colleagues (2007) for a critical discussion of their approaches. The latter authors concluded that the classification consistency measures CC and CC are more comprehensive than alternative measures because they consider the whole propensity distribution. Data Generation and Computational Details The 2-parameter logistic model (▇▇▇▇▇▇▇▇▇ & ▇▇▇▇▇, 2000, pp. 51–53) was used to simulate dichotomous item scores and the graded response model (▇▇▇▇▇▇▇▇▇ & ▇▇▇▇▇, 2000, pp. 97–102; ▇▇▇▇▇▇▇▇, 1969) to simulate rating-scale item scores under controlled and realistic conditions. For this purpose, for each single-test scenario we drew 1,000 latent-variable (denoted θ ) values from a standard normal distribution. Test batteries consisted of five tests. Each test measured a different θ . For each applicant we drew 5 θ values from a standard-normal multivariate distribution. To obtain test scores that correlated less than .2, θ s correlated .2. − − − − For dichotomous 40-item tests, item difficulties were equidistant between [ 1.5; 1.5] (Figure 2), and for rating-scale 40-item tests, mean item-difficulties were equidistant between [ 1.5; 1.5]. Rating-scale items had step difficulties lo- cated at 1.75, .75, .75, and 1.75 units distance relative to the mean item diffi- culty. Item discrimination parameters were drawn randomly from a uniform distri- bution [.5; 2] typical of applied research (e.g., ▇▇▇▇▇▇▇ & ▇▇▇▇▇▇▇, 1988; ▇▇▇▇▇, ▇▇▇, & ▇▇▇▇▇, 1993; ▇▇▇▇▇▇▇▇▇ & ▇▇▇▇▇, 2000, pp. 69, 101; ▇▇▇▇▇▇, ▇▇▇▇▇▇, & ▇▇▇▇▇▇▇, 2000; ▇▇▇▇▇▇▇ & Janosky, 1991) and recommended by the reviewers. Combined with a standard normal θ , these discrimination values produced cor- rected mean item-total correlations equal to .41 and coefficient alpha equal to .90 for 40 items, approximately .70 for 10 items and ranging from .50 to .55 for 5 items, depending on the selection scenario. As alphas less than .70 are deemed too low for practical use (Charter, 2003), we did additional simulations with less realistic—that is, too high—item-discrimination parameters and found that for a Downloaded by [Radboud Universiteit Nijmegen] at 06:01 04 November 2013 Latent variable distribution (θ ) and location of item difficulties (δj ) for a test consisting of 40 dichotomous items. 5-item test these parameters must be at least 1.6 to 2.9 to obtain alpha values of at least .70. Thus, realistic item discrimination yielded low alphas and unrealistically high item discrimination produced alphas acceptable in test practice. We decided to use realistic discrimination values, as the reviewers recommended, but also stud- ied parameter choices producing acceptable alphas. For test batteries (scenarios 3, 4, and 5), we assumed that the five tests were replicates of one another. For each of the 150 design cells, 1,000 data sets were simulated, each time using a fresh sample of 1,000 θ s. The complete design was replicated 20 times, each time using a different set of randomly drawn item discrimination parameters. We transformed latent variable θ to true score T . True scores were used for error-free selection representing the ideal outcome, and test score X+ to reject and select applicants thus representing realistic selection involving measurement error. Cut scores were determined as follows. For single-test Scenario 2, we first determined a continuous true-score cut off value denoted TC equal to the percentile rank that corresponded to the base rate. In practice, integer-valued test score XC is used rather than continuous TC; hence, TC was rounded to the integer value XC that best left the base rate intact. For cut-score selection using multiple tests and a disjunctive scoring rule (Scenario 4), we used the multivariate distribution of θ to derive the multivariate T distribution. The cut scores were five equal T values, chosen such that the probability that a randomly selected applicant had T values in excess of the corresponding cut score equaled the base rate. Downloaded by [Radboud Universiteit Nijmegen] at 06:01 04 November 2013 For top-down selection using score profiles (Scenario 5), the ideal score profile equaled the five mean θ scores that were transformed to true scores and test scores in order to have ideal scores for T and X+. Similarity index D2 was used for selection (Cronbach & Gleser, 1953; ▇▇▇▇▇▇▇, 1993). Index D2 is the sum of the squared differences between an applicant’s test scores and the ideal test scores. Applicants with the smallest D2 values were selected until the required proportion of applicants was obtained. | For each applicant, given their θ value error-laden test scores have a distribution denoted P (X+ θ ); this conditional distribution is needed to determine the person- specific correct-classification probability p. The conditional test-score distribu- tions follow a compound binomial (dichotomous items) or compound multinomial (rating-scale items) distribution that can be generated using a recursion algorithm (Lord & ▇▇▇▇▇▇▇▇▇, 1984; ▇▇▇▇▇▇▇ et al., 1995). The algorithm could not be used to compute correct-classification probability for the continuously distributed D2 (Scenario 5). Here, we simulated 1,000 test scores for each applicant keeping θ fixed. The correct classification probability was the proportion of correct classi- fications given the applicant’s θ value. CC+ is the proportion of respondents for whom T > TC and p ≥ .9. Other classification indices were computed likewise.
Appears in 1 contract
Sources: End User Agreement
Dependent Variables. Study 2, Individual Level, Classification Consistency For individuals, we addressed the classical statistical question whether a particular sample result is repeatable in a new sample. For measurement, repeatability refers to re-administering the same test to the same individuals a great number of times, and then determining to which degree the same conclusion about an individual was drawn. Figure 1 shows two propensity distributions (the small distributions) and the corresponding proportions p of correct classification—rejection for person A and selection for person B. Obviously, proportion p is larger as someone’s true score is located further away from the cut score, and the question is whether short tests produce too many small p values; that is, too many applicants on either side of the cut score that would be misclassified too often based on repeated observation of their test performance. ▇▇▇▇▇ and colleagues (2007) suggested that organizations set a lower bound to p, so that they make explicit what they think is the minimally acceptable certainty one requires for decisions about an individual. For example, an organization could require p to be at least .9, meaning that upon repetition an individual may be not be Downloaded by [Radboud Universiteit Nijmegen] at 06:01 04 November 2013 + − − + = = − + misclassified more often than in 10% of the test repetitions. Thus, for the propen- sity distribution on the right in Figure 1 the unshaded area must cover no more than 10% of the total area. The quality of a particular personnel-selection scenario and a particular test or test battery may be measured by the proportion of unsuited individuals (for whom T < XC) that have p values of .9 or higher; this is called classification consistency and denoted CC (the minus sign refers to the rejection area). Likewise, the classification consistency CC for the selection area could be determined. Large CC values suggest that the personnel-selection scenario and the test or test battery used produce certainty about individual decisions for many ap- plicants. The question is whether CC and CC are large enough when short tests are used. We studied CC and CC for lower bound values p .7 and p .9. + − ▇▇▇▇▇▇▇▇▇ and ▇▇▇▇▇▇ (1997) and ▇▇▇▇▇▇▇ and ▇▇▇▇▇▇ (2002) discussed similar approaches to rejection/selection problems but used different outcome measures; see Emons and colleagues (2007) for a critical discussion of their approaches. The latter authors concluded that the classification consistency measures CC and CC are more comprehensive than alternative measures because they consider the whole propensity distribution. Data Generation and Computational Details The 2-parameter logistic model (▇▇▇▇▇▇▇▇▇ & ▇▇▇▇▇Reise, 2000, pp. 51–53) was used to simulate dichotomous item scores and the graded response model (▇▇▇▇▇▇▇▇▇ & ▇▇▇▇▇, 2000, pp. 97–102; ▇▇▇▇▇▇▇▇, 1969) to simulate rating-scale item scores under controlled and realistic conditions. For this purpose, for each single-test scenario we drew 1,000 latent-variable (denoted θ ) values from a standard normal distribution. Test batteries consisted of five tests. Each test measured a different θ . For each applicant we drew 5 θ values from a standard-normal multivariate distribution. To obtain test scores that correlated less than .2, θ s correlated .2. − − − − For dichotomous 40-item tests, item difficulties were equidistant between [ 1.5; 1.5] (Figure 2), and for rating-scale 40-item tests, mean item-difficulties were equidistant between [ 1.5; 1.5]. Rating-scale items had step difficulties lo- cated at 1.75, .75, .75, and 1.75 units distance relative to the mean item diffi- culty. Item discrimination parameters were drawn randomly from a uniform distri- bution [.5; 2] typical of applied research (e.g., ▇▇▇▇▇▇▇ & ▇▇▇▇▇▇▇, 1988; ▇▇▇▇▇, ▇▇▇, & ▇▇▇▇▇, 1993; ▇▇▇▇▇▇▇▇▇ & ▇▇▇▇▇, 2000, pp. 69, 101; ▇▇▇▇▇▇, ▇▇▇▇▇▇, & ▇▇▇▇▇▇▇, 2000; ▇▇▇▇▇▇▇ & Janosky▇▇▇▇▇▇▇, 1991) and recommended by the reviewers. Combined with a standard normal θ , these discrimination values produced cor- rected mean item-total correlations equal to .41 and coefficient alpha equal to .90 for 40 items, approximately .70 for 10 items and ranging from .50 to .55 for 5 items, depending on the selection scenario. As alphas less than .70 are deemed too low for practical use (Charter, 2003), we did additional simulations with less realistic—that is, too high—item-discrimination parameters and found that for a Downloaded by [Radboud Universiteit Nijmegen] at 06:01 04 November 2013 Latent variable distribution (θ ) and location of item difficulties (δj ) for a test consisting of 40 dichotomous items. 5-item test these parameters must be at least 1.6 to 2.9 to obtain alpha values of at least .70. Thus, realistic item discrimination yielded low alphas and unrealistically high item discrimination produced alphas acceptable in test practice. We decided to use realistic discrimination values, as the reviewers recommended, but also stud- ied parameter choices producing acceptable alphas. For test batteries (scenarios 3, 4, and 5), we assumed that the five tests were replicates of one another. For each of the 150 design cells, 1,000 data sets were simulated, each time using a fresh sample of 1,000 θ s. The complete design was replicated 20 times, each time using a different set of randomly drawn item discrimination parameters. We transformed latent variable θ to true score T . True scores were used for error-free selection representing the ideal outcome, and test score X+ to reject and select applicants thus representing realistic selection involving measurement error. Cut scores were determined as follows. For single-test Scenario 2, we first determined a continuous true-score cut off value denoted TC equal to the percentile rank that corresponded to the base rate. In practice, integer-valued test score XC is used rather than continuous TC; hence, TC was rounded to the integer value XC that best left the base rate intact. For cut-score selection using multiple tests and a disjunctive scoring rule (Scenario 4), we used the multivariate distribution of θ to derive the multivariate T distribution. The cut scores were five equal T values, chosen such that the probability that a randomly selected applicant had T values in excess of the corresponding cut score equaled the base rate. Downloaded by [Radboud Universiteit Nijmegen] at 06:01 04 November 2013 For top-down selection using score profiles (Scenario 5), the ideal score profile equaled the five mean θ scores that were transformed to true scores and test scores in order to have ideal scores for T and X+. Similarity index D2 was used for selection (Cronbach & Gleser, 1953; ▇▇▇▇▇▇▇, 1993). Index D2 is the sum of the squared differences between an applicant’s test scores and the ideal test scores. Applicants with the smallest D2 values were selected until the required proportion of applicants was obtained. | For each applicant, given their θ value error-laden test scores have a distribution denoted P (X+ θ ); this conditional distribution is needed to determine the person- specific correct-classification probability p. The conditional test-score distribu- tions follow a compound binomial (dichotomous items) or compound multinomial (rating-scale items) distribution that can be generated using a recursion algorithm (Lord & ▇▇▇▇▇▇▇▇▇, 1984; ▇▇▇▇▇▇▇ et al., 1995). The algorithm could not be used to compute correct-classification probability for the continuously distributed D2 (Scenario 5). Here, we simulated 1,000 test scores for each applicant keeping θ fixed. The correct classification probability was the proportion of correct classi- fications given the applicant’s θ value. CC+ is the proportion of respondents for whom T > TC and p ≥ .9. Other classification indices were computed likewise.
Appears in 1 contract
Sources: End User Agreement