Kenneth G. Brown, Huy Le and Frank L. Schmidt
University of Iowa
INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT VOLUME 14 NUMBER 2 JUNE 2006
There has been controversy over the years about whether specific mental abilities increment validity for predicting performance above and beyond the validity for general mental ability (GMA). Despite its appeal, specific aptitude theory has received only sporadic empirical support. Using more exact statistical and measurement methods and a larger data set than previous studies, this study provides further evidence that specific aptitude theory is not tenable with regard to training performance. Across 10 jobs, differential weighting of specific aptitudes and specific aptitude tests were found not to improve the prediction of training performance over the validity of GMA. Implications of this finding for training research and practice are discussed.
Training is essential in today’s work organizations to help employees keep pace with rapid changes in the social, legal, and technical environments (Callanan & Greenhaus, 1999; Salas & Cannon-Bowers, 2001). From the organization’s perspective, training is an investment in employees, so understanding which employees benefit most from training is critically important. Research on this question has focused on many different trainee characteristics (Colquitt, LePine, & Noe, 2000; Noe, 1986), but the largest effects have been for general mental ability (GMA). GMA is often called intelligence and it is the common factor underlying performance on all mental ability tests (Jensen, 1998). Over the past 10 years there has been substantial theoretical and empirical progress in the study of GMA, and it is considered by many to be the best validated individual difference construct in psychology (Lubinski, 2000; Schmidt, 2002).
There has been some controversy over the years about whether specific mental abilities, measured by the tests that are used as indicators of GMA, are useful for predicting performance above and beyond the general factor (Ree, Earles, & Teachout, 1994). Many authors have proposed that differential weighting (such as via regression) of specific ability tests should yield better prediction of job and training performance than measures of GMA. This hypothesis is referred to as specific aptitude theory or differential aptitude theory, and it has been around for quite some time (Hull, 1928; Thurstone, 1938). Examples of it in practice and research are easy to provide. For example, a trainer who believes that results from a spatial ability test would predict performance in a computer-aided design course better than GMA, or that results from a vocabulary test would predict performance in a communication course better than GMA, subscribes to specific aptitude theory (see Schmidt, 2002, for selection-related examples). Researchers who subscribe to this theory use specific aptitude measures, such as quantitative, verbal, or spatial ability tests to predict performance criteria. They may also use differentially weighted combinations of specific aptitude tests that are weighted to match the expected ability demands of the job being studied (e.g., Hedge, Carter, Borman, Monzon, & Foley, 1992). As one published example of this practice, Mumford, Weeks, Harding, and Fleishman (1988) used tailored mental ability test composites (specific combinations of ability subtests weighted to match job requirements), rather than a GMA score, to predict training grades.
Although specific aptitude theory continues to be used in research and practice, there has been little empirical support for it. Prior large sample research suggests that, for both training and job performance, weighted combinations of specific cognitive aptitudes explain little if any variance beyond GMA (Hunter, 1986; McHenry, Hough, Toquam, Hanson, & Ashworth, 1990; Ree & Earles, 1991; Ree et al., 1994; Schmidt, 2002). Moreover, a meta-analytic comparison between GMA and specific aptitude tests revealed that validities for GMA are always higher than for specific aptitudes (Salgado, Anderson, Moscoso, Bertua, & de Fruyt, 2003a).
However, the prior research testing specific aptitude theory has limitations that the present study circumvents. Some studies using training performance as a dependent variable are limited in that they average across job families when estimating the relative validities of GMA and specific aptitudes (e.g., Hunter, 1986). This procedure may make it less likely to find effects for specific aptitudes, as the ability demands of training programs may differ across jobs that are grouped together in large job families. Other studies on training performance are limited because they do not correct fully for measurement error (e.g., Ree & Earles, 1991). Failing to correct for measurement error leads to biased multiple correlations and regression weights, and, potentially, erroneous conclusions about construct-level relationships (Hunter & Schmidt, 2004; Schmidt, Hunter, & Caplan, 1981).
There is research on specific aptitude theory using job performance as the criterion, but these studies also have limitations. First, these studies also do not fully correct for measurement error (e.g., McHenry et al., 1990; Ree et al., 1994). In addition, they use small samples within job family (Ree et al., 1994) or present results only for large job families (McHenry et al., 1990). Reliance on small samples increases the likelihood that results are distorted by sampling error. Finally, the generalizability of findings from job performance to training performance should not be assumed.
The purpose of this study is to examine specific aptitude theory with regard to training performance. This study improves on prior studies by using a larger data set and improved statistical methods. The data set includes 10 large sample training schools in the Navy, with an average sample size of 2608 [in contrast to average sample sizes of 148 (Ree et al., 1994) and 952 (Ree & Earles, 1991)]. The data analyses correct for range restriction and measurement error, and both regression and structural equation modeling (SEM) are used to ensure that results are not limited to one analytic approach. Correcting for measurement error is an important advance in this study, as prior research has not consistently performed such corrections. In this study, the ‘‘true score’’ or construct-level relationships between mental abilities and training performance are estimated along with the relationships at the observed score level. As described later, these two types of analyses answer different questions.
The data used in this study have an additional property that enhances their information value. The jobs under study differ in complexity level, allowing us to test specific aptitude theory across jobs of different complexity levels. Prior research has demonstrated that job complexity moderates the relationship between GMA and job performance (Hunter & Hunter, 1984; Salgado et al., 2003b; Schmidt & Hunter, 1998), with higher complexity jobs exhibiting higher validities. However, moderation of validity by complexity level is often weak for performance in training programs (e.g., Hunter & Hunter, 1984), possibly because of the pooling of data into large job families. Therefore, we explore the moderating effect of complexity with particular emphasis on whether specific aptitude theory holds in training certain jobs, but not others. It could be argued that specific aptitude theory is more likely to hold in training programs for low complexity jobs, where the effect for GMA is lower.
Specific Aptitude Theory
There are three levels of ability that can be estimated from mental ability tests: specific aptitudes, general aptitudes, and GMA. Specific aptitudes are assessed by individual tests, such as paragraph comprehension, mathematics knowledge, or mechanical comprehension. Such tests are often correlated and can be combined to measure general aptitudes, such as verbal or quantitative ability. At the broadest level, GMA represents the shared variance among all of these tests. Lubinski (2000) has noted that conceptual definitions of GMA vary, but generally converge on abilities to engage in abstract reasoning, solve complex problems, and acquire new knowledge. There is considerable agreement that mental abilities are organized hierarchically with GMA serving as a latent factor causing the positive correlations among various mental ability tests. This approach to conceptualizing and operationalizing GMA has resulted in a wealth of validity evidence supporting the conclusion that GMA predicts many life and work-related outcomes (Jensen, 1998; Lubinski, 2000; Ones, Viswesvaran, & Dilchert, 2004; Schmidt & Hunter, 2004).
Specific aptitude theory suggests that regression weighted combinations of specific and/or general aptitudes will be better predictors of work-related outcomes than GMA. For example, in occupations that include numerous math-related tasks such as accounting or financial planning, it would be hypothesized that the regression weight on quantitative aptitude would be larger than the weight for other aptitudes. Moreover, it would be hypothesized that the multiple R produced by the specific aptitudes tests would be larger than the zero-order validity of a GMA measure, which would include only the shared variance among all the specific aptitude tests used as its indicators.
Most recently published evidence disconfirms specific aptitude theory. Four studies are noteworthy because they employ large samples and suggest that specific aptitudes provide little incremental prediction over GMA. Two of these studies examine training performance (Hunter, 1986; Ree & Earles, 1991), and the other two examine job performance (McHenry et al., 1990; Ree et al., 1994). Each is discussed below followed by an explanation of its limitations.
Training Performance. Hunter (1986) summarized data from 82,437 military trainees to show that the average predictive validity for GMA (.63) is equal to or higher than the average predictive validity (average adjusted multiple R) of specific ability test composites (.58–.63). The primary limitation of the Hunter (1986) study is that validities are not reported for individual jobs but for large groups of jobs. It could be that different specific aptitudes are important in different jobs, and that averaging across jobs masks differences in specific aptitude validities. As a result, the importance of specific aptitudes may have been underestimated.
Ree and Earles (1991) examined 78,041 Air Force enlistees who completed both basic and specific job training programs. Across 82 job training programs, the authors demonstrated that the factors in the Armed Services Vocational Aptitude Battery (ASVAB) remaining after controlling for the first principal component, which represents GMA, produced little incremental validity over the GMA factor. A limitation of this work is that analyses did not correct for measurement error in either the independent or dependent variables. The limitation of this approach will be discussed in more detail later in this section.
Job Performance. Two studies examined specific aptitude theory but with job performance as the dependent variable. As part of Project A, McHenry et al. (1990) analyzed nine jobs (average N = 449) and found that across five job performance factors the validity of GMA was always greater than the validity of spatial ability or perceptual-psychomotor ability. Using Air Force data across seven jobs (average N = 148), Ree et al. (1994) found that specific abilities incremented the prediction of job performance over GMA by only a small amount (.02 on average). Neither of these studies performed corrections for measurement error.
In summary, the evidence presented to date casts doubt on specific aptitude theory. However, limitations in these studies indicate the need for further research. A stronger test of the theory with regard to training performance would examine validities for training success using large samples for individual jobs, and would fully correct for measurement error. In addition, this research would examine separately the role of specific aptitudes, general aptitudes, and GMA in predicting training success.
Role of Measurement Error. As noted by Schmidt et al. (1981), theory-driven research should examine validities at the true score or construct level. The true-score level refers to the relationship among the constructs free from measurement error and other statistical biases. Examining validities calculated on imperfect measures often produces an inaccurate picture of the relative importance of the abilities themselves. This occurs because partialling out imperfect measures does not fully partial out the effects of underlying constructs (Schmidt et al., 1981). To illustrate, suppose that Ability A is a cause of training performance but Ability B is not. Suppose further that the tests assessing these abilities have reliabilities of .80 and are positively correlated (as occurs with all mental ability tests). Because Ability B is correlated with Ability A, Ability B will show a substantial validity for training performance. Moreover, because Ability A is not measured with perfect reliability, partialling it from Ability B in a regression analysis would not partial out all of the variance attributable to Ability A. Thus, the measure of Ability B will receive a substantial regression weight when in fact the construct-level regression weight is zero. That is, Ability B will predict training performance even though it is not a true underlying cause of training performance. In this case, Ability B will appear to increment validity over Ability A only because of the presence of measurement error (see also Schmidt & Hunter, 1996).
To obtain accurate population estimates, the relationships among predictors and criterion must be corrected for measurement error before computing validities. None of the prior research on specific aptitude theory has examined prediction with true scores. That is, research to date has not examined the relationships among specific aptitudes, GMA, and performance after correcting for measurement error in both the criterion and predictors.
In contrast to specific aptitude theory, GMA theory predicts that the primary cognitive variance that predicts learning outcomes, such as training performance, will be contained in the general factor underlying mental ability tests scores (Jensen, 1998; Schmidt & Hunter, 2004). General intelligence has been shown to predict learning in countless studies, and it is viewed by many to be the primary individual difference determinant of learning outcomes (Gottfredson, 2002; Lubinski, 2000).
As presented by Schmidt and Hunter (2004), the GMA model of training performance implies a model in which there are no effects for specific and/or general aptitudes on training performance above that accounted for by GMA. This model as captured by subtests of the ASVAB is depicted in Figure 1. Verbal (VERBAL), quantitative (QUANT), and technical (TECHN) are general aptitudes captured by various tests in the ASVAB. In comparison, Figure 2 presents a model that does not contain the GMA factor, and the three general aptitudes directly influence training performance. Specific aptitude theory predicts that the model in Figure 2 will result in better prediction of training performance and better model fit than that produced by the model in Figure 1. On the other hand, GMA theory predicts that the use of specific mental ability tests or general aptitudes will produce no gain in prediction over and above that produced by the GMA factor.
Another issue that is examined with these data is whether the magnitude of the effect of GMA varies across training programs for different jobs. Evidence for such differences is mixed. Hunter and Hunter (1984) found relatively small differences in predictive validities for training performance across job families, but the jobs in this study had limited variability in complexity. The jobs studied were of medium or higher complexity. In contrast, Salgado et al. (2003b) found that, after correcting for multiple statistical artifacts, training validities increased from low to high levels of job complexity (r’s of .36, .53, .72 with increasing complexity). This latter finding is consistent with the general finding that GMA predicts performance better for more complex jobs (Gottfredson, 2002; Schmidt & Hunter, 2004). Consistent with these findings, we predict that GMA validities will be higher for training programs of more complex jobs. Moreover, we expect that, if specific aptitude theory is supported at all, it will receive more support in jobs of lower complexity where the effects of GMA are smaller.
Data for this study were drawn primarily from three sources. First, predictive validities for the ASVAB test battery were obtained from 26,097 trainees enrolled in 10 of the largest Navy technical (‘‘A’’ Class) schools in 1988. Schools and their associated jobs are described in Table 1. Specific demographic information could not be obtained on these particular trainees but it is known that they were nearly all males between the ages of 18 and 30, with the majority being Caucasian. Second, correlations among subtests of the ASVAB were obtained from the 1987 applicant population (N = 143,856). This eliminated the need to correct the subtest inter-correlations for range restriction because the correlation matrix is the population matrix of interest (as explained later, the validity coefficients did require correction for range restriction). Third, we calculated test reliabilities from the alternate form reliabilities of ASVAB subtests from the 1983 norming study with 5,517 service applicants (Technical Supplement to the Counselor’s Manual for the ASVAB Form-14, 1985). We used the reliabilities for males in grades 11 and 12, as the majority of trainees in this study were male. The reliabilities were adjusted to correspond to the test score standard deviations in the 1987 applicant population (see Magnusson, 1966, pp. 75–76; Nunnally & Bernstein, 1994, p. 261, Eq. 7–6). The test reliabilities ranged from .91 (mathematics knowledge) to .78 (electronics information). Because the alternative form reliabilities were obtained by correlating tests taken on the same day, these reliabilities are slightly inflated. They do not fully control for transient measurement error (Schmidt, Le, & Ilies, 2003). Consequently, these reliability estimates result in a slight undercorrection for measurement error.
Training success was assessed with final school grade (FSG) received by trainees in their school. FSG is typically created as the average of several multiple choice test scores administered throughout training (e.g., Ree & Earles, 1991). We could find no established estimate for reliability of FSG, but as it is based on multiple tests within each course, it is likely to be highly reliable. In the reported analyses we assumed a reliability of .90. Analyses were also conducted presuming no measurement error (reliability = 1.0) and lower reliability (reliability = .80). The results (available upon request) did not vary substantially from those reported here.
Specific aptitudes were measured as scores on individual ASVAB tests. Subjects took one of the parallel forms of the ASVAB administered in 1988 – Form 11, 12, 13, or 14. Schmidt and Hunter (2004) presented a measurement model for GMA based on six subtests of the ASVAB: Word Knowledge (WK), General Science (GS), Arithmetic Reasoning (AR), Mathematics Knowledge (MK), Mechanical Comprehension (MC), and Electronics Information (EI). In this study, the Paragraph Comprehension (PC) test was substituted for the GS test as an indicator of Verbal aptitude because it has a lower cross loading with the Technical general aptitude factor described below. The other ASVAB subtests were not included either because they are speeded tests that have low loadings on general aptitude and GMA factors (Coding Speed and Numerical Operations; Hunter, 1986; McHenry et al., 1990) or because of cross-loadings across general aptitude factors (General Science and Auto/Shop Knowledge; Kass, Mitchell, Grafton, & Wing, 1982). Notably, in 2002, the Coding Speed and Numerical Operations subtests were dropped from the ASVAB. More complete descriptions of these tests are available elsewhere (e.g., Kass et al., 1982; Murphy, 1984).
The differential weighting asserted by specific aptitude theory was operationalized via regression and path analysis. In the regression analyses, general aptitudes were assessed as composites of their associated specific aptitudes. General aptitude factors were estimated based on the following equally weighted indicators: Quantitative (Q: AR and MK), Technical (T: MC and EI), and Verbal (V: WK and PC). For use in the true score regression analysis, reliabilities of these composites were estimated using the composite reliability formula in Hunter and Schmidt (2004, p. 438, Eq. 10.14). The reliabilities were .85 (V), .86 (Q), and .80 (T). In the SEM (or path) analyses, general aptitudes were operationalized as the latent factor causing their two associated indicator tests.
The ASVAB does not have an overall score, nor is one created by the military in the use of this particular test. For the bivariate analysis at the observed score level, we created an overall GMA composite that is the equally weighted sum of the three general aptitude scores defined earlier. For example, the Quantitative aptitude score was defined as AR+MK. For each job, the observed correlation between this GMA composite and the criterion of training success was computed using the formula for the correlation of composites given in Nunnally and Bernstein (1994). Reliability of this composite was estimated to be .85 using the composite reliability formula (Hunter & Schmidt, 2004, p. 438), and this reliability was used to make the correction for measurement error required to estimate the construct level GMA correlation with the training success criterion. In SEM analyses, GMA was operationalized as a second order factor causing the three general aptitudes.
Training complexity was assessed with two measures obtained from different sources. Length of training (in days) was obtained via archival descriptions of the training programs on a Navy recruiting website. Length of training varied from 30 to 89 days. Unfortunately, data could not be obtained on two jobs that had been phased out by the Navy since 1988. Length of training should capture the relative complexity of the training program as longer training programs would be necessary to cover the knowledge requirements of more complex jobs. The second source of complexity data was Hedge, Carter, Borman, Monzon, and Foley (1992). The authors had 23 experts rate the ability requirements of Navy technical schools, including those in this study. Experts rated the quantative, verbal, and technical ability requirements on a 3-point scale (0 = ability not required, 1 = ability somewhat important for success, and 2 = ability very important for success), with considerable agreement (intraclass correlation of .95). The sum of these ratings was used as the measure of complexity for each school, as greater mental ability requirements would be estimated for more complicated training programs. The ability requirement measure of complexity varied from two to five, and despite its limited range, it correlated highly with length of training (r = .60).
Prior research on specific aptitude theory has tended to use a single analytical technique – either regression based on observed scores, or regression based on partially corrected scores (corrected only for range restriction and measurement error in the dependent variable). In this study we present regressions for partially and fully corrected scores, and we present SEM results. SEM results also fully correct for measurement error, although the statistical method used differs from the method we use to perform the fully corrected regression. As a result, the inclusion of SEM results reveals whether the construct-level results vary by statistical technique.
Before all analyses, the predictive validities were corrected for range restriction, using the Lawley (1943) formula, and for measurement error in the FSG measure of training success using the classic disattenuation formula. As noted earlier, subtest inter-correlations were not corrected for range restriction because the applicant population matrix was used in all analyses.
Figure 3 summarizes the analysis plan. In Figure 3, the first row indicates that we present three regression analyses with partially corrected scores. These analyses are similar to the analyses presented by Ree and Earles (1991), and make no adjustments for measurement error in the predictors.
The second row indicates that we present three regression analyses with true scores, correcting the observed data for measurement error in the predictors.
These results provide a picture of construct-level relationships, rather than relationships between imperfect measures. Regressions in both rows were conducted with Hunter’s program REGRESS, which provides accurate standard error estimates for regression coefficients based on corrected correlations (Hunter & Cohen, 1995).
The third row indicates that two SEM tests are also conducted using LISREL 8.51, which uses a different estimation algorithm (ML instead of OLS) and a somewhat different method of correcting for measurement error. More specifically, corrections for measurement error in SEM are based on the congeneric model of measurement equivalence, in contrast to the parallel forms model of measurement equivalence that is the basis for corrections made using reliability coefficients (Nunnally & Bernstein, 1994).
In the analyses in which GMA is the only predictor (C, F, and H in Figure 3), the statistic of interest is the zero order correlation between GMA and the criterion. In all of the analyses with multiple predictors, the primary statistic of interest is the adjusted R. The adjustment for capitalization on chance was conducted using the Wherry formula (Cattin, 1980), which provides an estimate of the R that would be produced by the population regression weights. The sample size used in this adjustment was derived using a formula from Schmidt, Hunter, and Larson (1988); this formula is described by Ree et al. (1994). The formula adjusts the actual sample size to account for the increase in sampling error caused by range restriction corrections. One minor adjustment was made to the formula reported by Schmidt et al. (1988) and Ree et al. (1994). The standard error of the corrected correlation that was used to calculate the ‘‘Effective N’’ was calculated using a more accurate formula discussed by Raju and Brand (2003) and Hunter and Schmidt (2004, p. 109, Eq. 3.21). Table 1 provides the resulting ‘‘Effective N,’’ which is the N used in calculating the adjusted R (and all other statistics that include sample size).
Differences in predictive validity were examined across analyses A, B, and C, and across D, E, and F. Specific aptitude theory would suggest that validities for A and B should be larger than for C, and those for D and E should be larger than for F. Moreover, to the extent that specific aptitudes are more important in jobs that are lower in complexity, these differences should be more pronounced in jobs that have shorter training times and lower overall ability requirements. If, on the other hand, GMA provides equal or better prediction of training success in equations C and F, then specific aptitude theory is disconfirmed.
In the SEM analyses, the fit of the general aptitude and GMA models within each training school was examined, as well as the predictive validities. In addition to R values (adjusted for capitalization on chance in the general aptitude model), model fit for the general aptitude model and GMA model were calculated for comparison. Model fit statistics for the specific aptitude model (six tests predicting training performance) are not reported because the model is fully saturated (i.e., model fit is perfect).
The general aptitude (Analysis G) and GMA models (Analysis H) are not nested models because they contain different numbers of latent factors (three vs. four, respectively). Most methodologists suggest that informational or descriptive fit statistics (rather than comparative fit statistics) should be used under these conditions (Browne & Cudeck, 1993); such models do not use baseline models as the standard by which fit is judged. In the case of non-nested models, the baseline models differ, so observed differences in comparative model fit are difficult to interpret. Consequently, the following descriptive fit statistics are presented: (1) χ² to degree of freedom ratio, (2) Root mean square error of approximation (RMSEA), (3) Expected cross-validation index (ECVI), and (4) Akaike Information Criterion (AIC). There are no standard interpretations for the χ² to degree of freedom ratio (Bollen, 1989) but lower values indicate better fit. RMSEA values are typically interpreted as follows: .05 or lower indicates good model fit; .05 to .08 fair fit; .08 to .10 mediocre fit; and over .10 poor fit (MacCallum, Browne, & Sugawara, 1996). Both ECVI and AIC are less frequently used than comparative fit indices, and they do not have a standard interpretation; instead, they are used to directly compare alternative models. Both present descriptive values about the degree of fit of the predicted to observed correlation matrix. Thus, as with the χ² to degree of freedom ratio and RMSEA, lower values indicate better model fit. Although not typically suggested for comparing non-nested models, one comparative fit index is presented for purpose of illustration, the Tucker–Lewis Index or non-normed fit index (NNFI). In combination these fit statistics allow for a determination of whether the general aptitude or GMA models provide a relatively better fit to the observed data.
Tables 1–3 summarize the data used for the analyses. Table 1 describes the training programs and presents the data on sample sizes, training length, and expert-rated ability requirements. Table 2 reports the reliabilities of and uncorrected inter-correlations among the six ASVAB subtests used in this study. As would be expected in an unrestricted sample, the tests are highly correlated (r’s range from .51 to .75), and reliable (alternate form reliabilities range from .78 to .91).
Table 3 reports the validities of the subtests for predicting FSG by school, corrected for range restriction and measurement error in FSG (but not for measurement error in the tests). The quantitative subtests display higher predictive validities than the other subtests, but the sample-weighted mean validities across tests (collapsed across schools) do not appear to vary substantially (r = .40–.49).
That is, the tests perform similarly in predicting FSG. In contrast, the mean validities across schools (collapsed across tests) vary substantially (r = .34–.58), suggesting that the validity of mental ability tests varies across training programs for different jobs.
Tables 4 and 5 summarize the regression analyses. Table 4 summarizes the regression analysis based on observed predictor scores. Analysis A presents the prediction of FSG by the 6 subtests. Across the 10 schools, adjusted R values range from .43 (BT) to .73 (ET), with a sample-weighted mean value of .55 across schools. Analysis B presents the prediction of FSG by the 3 general aptitudes. Adjusted R values range from .43 (BT) to .73 (ET), with a sample-weighted mean value of .55. Analysis C presents the prediction of FSG by GMA; zero-order prediction ranged from .42 (AM) to .71 (ET) with a sample-weighted mean of .55.
The last two columns in Table 4 present the differences between these values, which are remarkably small and do not vary much between schools. Because the pattern of results is similar across the 10 schools, averages are informative. The average difference in adjusted R between A and B is .00, between A and C is .01, and between B and C is .01. These gains in predictive validity for using specific aptitude or general aptitude over prediction from GMA are very small. These results shed light on the predictive gains from regression-weighted measures of specific and general aptitudes in an applied selection context; the maximum improvement in validity from using specific or general aptitudes is less than 2%, which was the figure reported by Ree and Earles (1991) and Ree et al. (1994).
As noted earlier, the results of observed score analyses can be misleading when one’s concern is theoretical and the research questions of interest involve the underlying constructs. This occurs because measurement error in the predictors can distort both the multiple correlations and the relative size of the regression weights and cause observed measures to show incremental validity that does not exist at the level of the constructs.
Table 5 summarizes the regression analyses for scores corrected for measurement error in the predictors. Analysis D presents the prediction of FSG by the six subtests. Across the 10 schools, adjusted R values range from .44 (BT) to .75 (ET), with a sample-weighted mean value of .56 across schools. Analysis E presents the prediction of FSG by the three general aptitude constructs. Adjusted R values range from .45 (BT) to .77 (ET), with a sample-weighted mean value of .58. Analysis F presents the prediction of FSG by GMA; zero-order prediction ranged from .46 (AM) to .77 (ET) with a sample-weighted mean of .58.
The last two columns in Table 5 indicate that the differences between these values are small and do not vary much between school. Again, because the pattern of results are similar across the 10 schools, averages can be used to illustrate. The average difference between D and E is -.01, between D and F is -.02, and between E and F is -.01.
Thus, on average, prediction by the GMA construct is better than prediction by weighted combinations of either the general or specific aptitude constructs, although by very small margins. This finding is very close to the GMA theory prediction of equal predictive power, but very different from the prediction of specific aptitude theory. Moreover, the small predictive advantage gained by including specific aptitude tests, shown in Table 4 and in prior research (Ree & Earles, 1991; Ree et al., 1994), completely disappears in these construct-level analyses. The hypothetical illustration presented earlier from Schmidt et al. (1981) presents the conceptual explanation for this reversal. The presence of measurement error causes measures of specific and general aptitudes to make contributions (however small) to prediction that do not exist at the construct level.
Table 6 summarizes the SEM analyses. Analysis G presents the three factor general aptitude model (see sample model in Figure 2); Analysis H is the GMA model in which the three general aptitudes load onto GMA, and GMA predicts FSG (see sample model in Figure 1). Fit indices presented in this table demonstrate that the models fit the data. Sample-weighted mean fit indices for the three-factor and GMA models are, respectively: χ² to degree of freedom ratios of 4.21 and 4.35; RMSEA’s of .06 and .06; ECVI’s of .10 and .10; AIC’s of 75.86 and 81.80; and NNFI’s of .98 and .98. While the NNFI values are very high and suggest excellent fit, the RMSEA’s include both good (< .05: MM, ST) and fair fit (between .05 and .08: AE, AM, BT, ET, EM, OS, RM, and SM). In only one case (RM with GMA model, RMSEA = .087) does an RMSEA value exceed the .08 threshold for fair fit. Thus, both models fit the data reasonably well, and these minor differences aside, the models fit all 10 schools.
The general trend in all of these indices is for the three-factor model to fit the data better, but the differences are small enough to be considered negligible. Thus, despite the addition of a latent factor and constraints imposed by forcing the general aptitudes to load on that factor, the GMA model fits the data as well as the three-factor model.
As would be expected, predictive validities using SEM are similar to the corrected analyses presented in Table 5. The small difference obtained (-.01) across schools slightly favors the GMA model over the general aptitude model. Again, the difference is too small to be of importance. Moreover, as with the construct-level analysis reported in Table 5, the results in Table 6 do not show any incremental prediction from including specific or general aptitudes beyond GMA.
Finally, analyses were conducted to examine possible differences in validities based on complexity of training. Results from Tables 4–6 all reveal what appear to be two clusters of predictive validities. Meta-analysis of schools with lower validities (AE, AM, BT, MM, RM, and SM) and higher validities (ET, EN, OS, and ST) reveals sample-size weighted mean validities of .45 (90% confidence interval [CI] .427, .477) and .67 (90% CI .646, .702) from Analysis C reported in Table 4. These confidence intervals have no overlap, indicating that the population values differ substantially across the two clusters of schools. Moreover the magnitude of the validity increases as the length and expert-rated ability requirements of the training program increases. The zero-order correlation between length of school and GMA validity was .77; the mean length of school is 75 days in the high validity cluster, and 47 days in the low validity cluster. The zero-order correlation between ability requirements and GMA validity was .33; the mean ability requirement is 4.00 in the high validity cluster, and 3.50 in the low validity cluster. Thus, based on these correlations, it appears that the greater the complexity of the training program, the higher the validity for GMA.
Finally, the pattern of results for specific aptitude vs. GMA theory was not affected by either apparent training complexity or magnitude of the GMA validity. Differences in predictive validity across general aptitude and GMA models reported in Table 4 (observed predictor score regressions) varied from only .00 to .04, and these differences were identical in the high complexity/high GMA jobs (average difference = .01) and the low complexity/low GMA jobs (average difference = .01). Differences in general aptitude and GMA predictive validity reported in Table 5 (corrected predictor score regressions) varied from -.04 to .04, and again these differences were similar in the high complexity/high GMA jobs (average difference = -.02) and the low complexity/low GMA jobs (average difference = .00). Finally, in the SEM analyses reported in Table 6, differences in general aptitude and GMA predictive validities only varied from -.02 to .01, and were similar in the high complexity/high GMA (average difference = -.02) and low complexity/low GMA (average difference = -.01) clusters. The predictions from GMA theory were supported across jobs of varying complexity and GMA demands.
This study overcame and avoided the methodological deficiencies of previous studies on the question of incremental prediction of specific aptitudes over GMA. More specifically, large sample individual jobs (rather than job families) that varied in complexity were examined, and measurement error corrections were made using multiple approaches. Given the importance and plausibility of specific aptitude theory, testing the theory under optimal conditions with the most accurate available statistical techniques is necessary to advance our understanding of the link between mental abilities and training performance.
With the improved methods used in this study, specific ability tests provided little if any incremental validity in the prediction of training success over GMA. This finding held through three different approaches to the data analysis – regression based on observed predictor scores, regression based on construct scores, and SEM (which is another method of examining relationships among construct scores). Notably, the 2% incremental prediction found in prior research (e.g., Ree & Earles, 1991) effectively disappeared when corrections for measurement error were performed in the latter two analyses. In combination with prior research, these results provide strong evidence against specific aptitude theory.
These results suggest that specific aptitude theory should not be retained in the prediction of global measures of training success. These results do not go so far as to indicate that specific aptitudes have no psychological significance or meaning, but they constitute compelling evidence that learning for a variety of jobs is predominately determined by GMA, not by specific aptitudes. That is, they show that the specific factors in the aptitude measures (that is, the factors measured in the specific aptitude tests beyond GMA) do not contribute to prediction. Likewise, the components of the general aptitudes (V, Q, and T) that go beyond merely reflecting GMA do not contribute to prediction. This is the major theoretical implication.
These results may help explain recent meta-analytic findings. Based on data from European countries, Salgado et al. (2003a) showed that GMA has higher predictive validities for training performance than specific ability tests. The mean estimated operational validity of GMA was .54 (K = 97, N = 16,065) whereas the validities for more specific aptitudes varied from .48 to .25. Viewed from the perspective that specific aptitudes are imperfect indicators of GMA, specific aptitudes tests predict some but not all of the variance in training performance that can be predicted by GMA. Because each specific aptitude test is a relatively poor indicator of GMA, predictive validities for specific aptitudes should always be lower than when a more complete measure of GMA is used.
Specific aptitude theory can be viewed as a special case of the theory that matching predictor and criterion constructs will lead to higher validity. For example, specific aptitude theory says that if a job involves reading and writing, a verbal ability test will have higher validity than a GMA test, because the verbal construct is predominant in both the predictor and the criterion, producing a match. Conversely, the reason specific aptitude theory predicts lower validity for GMA is that the construct of GMA is quite different from, and does not ‘‘match,’’ the construct of verbal ability that appears to be required for the job. So it is clear that the results of our study contradict the predictor-criterion construct matching theory in the area of mental abilities. However, for other predictors, such as job knowledge tests, that theory may be valid. For example, of several job knowledge tests, the most valid one is likely to be the one whose content most closely matches the content of the job.
The above example raises the question of the precise nature of the difference between specific aptitude and job knowledge tests. The key difference is their relationship with GMA. Specific aptitude measures have higher GMA loadings than job knowledge tests in most populations. Job knowledge tests are expected to have high GMA loadings in groups in which all members have had equal opportunity to learn the knowledge content. This would be true, for example, if the subjects were incumbents and all had been on the job the same length of time. However, such groups are rare and in most applicant samples individuals differ widely in previous opportunity to learn the specific content of the knowledge tests and therefore score differences are due less to GMA and more to differences in previous opportunity, resulting in lower GMA loadings. By contrast, measures of specific aptitudes have high GMA loadings in all groups. This, rather than the content of the test, is the critical difference between specific abilities and job knowledge tests. For example, in many previous studies using the ASVAB the subtests General Science has been found to be an excellent measure of verbal aptitude and to have a high GMA loading. Although this is ostensibly a measure of knowledge, the knowledge domain measured is quite broad and every individual has had substantial opportunity to learn this general knowledge. In addition, the knowledge is conceptual in nature. Hence it serves as an excellent measure of a specific aptitude and has a high GMA loading.
These findings also have implications for the question of whether the predictive validity of GMA varies across training for different types of jobs. In contrast to some prior research (e.g., Hunter & Hunter, 1984; Jones & Ree, 1998), validities in this set of jobs varied considerably. Specifically, the largest validity from the SEM analysis (.78) was 70% greater than the lowest validity (.46). Moreover, there was a clear pattern to these differences; the predictive validities increased substantially as the complexity of the training increased. Of course, the measures of training complexity were indirect because a more direct measure could not be obtained for these data. However, both measures of complexity indicated the same results, thus raising confidence in our conclusion. We can safely conclude that the validity of GMA is high across all programs but not identical in magnitude.
The primary practical implication of this finding is that weighted combinations of specific aptitudes tests, including those that give greater weight to certain tests because they seem more relevant to the training at hand, are unnecessary at best. At worst, the use of such tailored tests may lead to a reduction in validity. For prediction of training success, a good measure of GMA is likely to yield prediction at least as good as that produced by multiple aptitude measures in a regression equation. This point is particularly useful for researchers who seek to control for abilities relevant to learning when studying other constructs, such as motivation to learn (Colquitt et al., 2000). In such situations, a GMA measure can be considered sufficient for controlling for mental abilities, at least when examining overall training success.
It is worth revisiting the distinction between training and job performance and its relevance for this study. While these findings specifically address training performance, they have implications for understanding and predicting job performance as well. Prior evidence strongly suggests that training performance and job performance are correlated, with training performance and associated job knowledge serving as a meaningful determinant of job performance (Hunter, 1986). Moreover, many authors argue that with increasing complexity and dynamism of work today, workers are required to continually update their skills by training and other less formal means of learning (e.g., Kraut & Korman, 1999). From this vantage point, the ability to learn is not only a predictor of job performance, but arguably an increasingly important component of it as well. Consequently, we believe these findings would be replicated if conducted with job performance measures as dependent variables.
Limitations and Future Research
Despite the large sample sizes, this study does have limitations. First, analyses were not conducted on a representative sample of Navy or, for that matter, civilian jobs. Data from large sample jobs were specifically requested from the Navy in order to reduce sampling error. Future research on mental ability and training performance could seek a broad set of representative jobs, including some jobs that are less technical and less heavily dependent on GMA (e.g., basic customer service jobs). However, prior research that uses a broader sample of jobs has found similar results with regard to specific aptitude theory (Hunter, 1986; Ree & Earles, 1991), so the conclusions are unlikely to differ. Second, some information about the jobs and schools was missing, and as a result it was necessary to use indirect and sometimes incomplete measures. Detailed information about each school would have been useful to determine if some feature of a school other than complexity affected validities. Snow (1989), for example, indicates that the largest aptitude-by-treatment interaction found in educational research is for intelligence and structure, with less intelligent students benefiting much more from structured material than more intelligent students. Military training is highly structured and developed using a standardized instructional design process, thus it is unlikely that schools had vastly different instructional characteristics. Nevertheless, it is possible that differences in instructional process across schools may have played a role in the observed effects. Third, this study uses only a global indicator of training success – final course grade. Future research might benefit from decomposing final grade into different learning outcomes, such as the acquisition of knowledge, acquisition of skill, and socialization to desired attitudes and values (Kraiger, Ford, & Salas, 1993). Although prior research does not suggest that specific aptitudes will provide better prediction of narrower training criteria than GMA (Duke & Ree, 1996; Olea & Ree, 1994), future research could examine even more fine-grained measures of training success, particularly desired attitudes and values which have received relatively little research attention.
Specific aptitude theory has intuitive appeal because it suggests that each individual may have personal strengths with regard to mental abilities that allow him/her to succeed at different learning tasks. Despite its appeal, the data presented here do not support the theory. Optimally weighted combinations of specific aptitudes that serve as indicators of GMA do not provide incremental validity over GMA for the prediction of training success. Moreover, the GMA causal model fits observed data across jobs as well as the specific aptitude model. Thus, we conclude there is no reason to expect that tailored test composites will be more useful than an overall measure of GMA in predicting overall training success.