**Score gains on g-loaded tests : No g**. Jan te Nijenhuis, Annelies E.M. van Vianen, Henk van der Flier, 2007.

IQ scores provide the best general predictor of success in education, job training, and work. However, there are many ways in which IQ scores can be increased, for instance by means of retesting or participation in learning potential training programs. What is the nature of these score gains? Jensen … argued that the effects of cognitive interventions on abilities can be explained in terms of Carroll’s three-stratum hierarchical factor model. We tested his hypothesis using test–retest data from various Dutch, British, and American IQ test batteries combined into a meta-analysis and learning potential data from South Africa using Raven’s Progressive Matrices. The meta-analysis of 64 test–retest studies using IQ batteries (total N=26,990) yielded a correlation between g loadings and score gains of −1.00, meaning there is no g saturation in score gains. The learning potential study showed that: (1) the correlation between score gains and the g loadedness of item scores is −.39, (2) the g loadedness of item scores decreases after a mediated intervention training, and (3) low-g participants increased their scores more than high-g participants. So, our results support Jensen’s hypothesis. The generalizability of test scores resides predominantly in the g component, while the test-specific ability component and the narrow ability component are virtually non-generalizable. As the score gains are not related to g, the generalizable g component decreases and, as it is not unlikely that the training itself is not g-loaded, it is easy to understand why the score gains did not generalize to scores on other cognitive tests and to g-loaded external criteria.

2. Jensen’s hypothesis: score gains can be summarized in the hierarchical intelligence model

It is hypothesized that a training effect is most clearly manifested at the lowest level of the hierarchy of intelligence, namely on specific tests that most resemble the trained skills. One hierarchical level higher, the training effect is still evident for certain narrow abilities, depending on the nature of the training. However, the gain virtually disappears at the level of broad abilities and is altogether undetectable at the highest level, g. This implies that the transfer of training effects is strongly limited to tests or tasks that are all dominated by one particular narrow skill or ability. There is virtually no transfer across tasks dominated by different narrow abilities, and it disappears completely before reaching the level of g. Thus, there is an increase in narrow abilities or test-specific ability that is independent of g. Test-specific ability is defined as that part of a given test’s true-score variance that is not common to any other test; i.e., it lacks the power to predict performance on any other tasks except those that are highly similar.

Gains on test specificities are therefore not generalizable, but ‘empty’ or ‘hollow’. Only the g component is highly generalizable. Jensen (1998a, ch. 10) gives various examples of empty score gains, including a detailed analysis of the Milwaukee project, claiming IQ scores rose, but not g scores. Another example of empty score gains is given by Christian, Bachnan, and Morrison (2001) who state that increases due to schooling show very little transfer across domains.

It is hypothesized that the g loadings of the few tests that are most similar to the trained skills and therefore most likely to reflect the specific training diminish after training. That is, after training, these particular tests reflect the effect of the specific training rather than the general ability factor. […]

However, Ackerman (1987) cites several classical studies on the acquisition of simple skills through often repeated exercise where low-g persons made the most progress. These findings could be interpreted as an indication that this specific skill acquisition process is not g-loaded.

3. First test of Jensen’s hypothesis: studies on repeated testing and g loadedness

In a classic study by Fleishman and Hempel (1955) as subjects were repeatedly given the same psychomotor tests, the g loading of the tests gradually decreased and each task’s specificity increased. Neubauer and Freudenthaler (1994) showed that after 9 h of practice the g loading of a modestly complex intelligence test dropped from .46 to .39. Te Nijenhuis, Voskuijl, and Schijve (2001) showed that after various forms of test preparation the g loadedness of their test battery decreased from .53 to .49. Based on the work of Ackerman (1986, 1987), it can be concluded that through practice on cognitive tasks part of the performance becomes overlearned and automatic; the performance requires less controlled processing of information, which is reflected in lowered g loadings.

4. Second test of Jensen’s hypothesis: studies on practice and coaching

Three studies on practice and coaching have shown increases in test scores that are not related to the g factor. This suggests that the gains are ‘empty’ or ‘hollow’. In the first study, Jensen (1998a, ch. 10) analyzed the effect of practice on the General Aptitude Test Battery (GATB). He found negative correlations ranging from −.11 to −.86 between effect sizes on practice and the tests’ g loadings. Therefore, the gains were largest on the least cognitively complex tests. In the second study, te Nijenhuis et al. (2001) found a small correlation of −.08 for test practice, and large negative correlations of −.87 for both of their two test coaching conditions. Jensen carried out a factor analysis of the various GATB score gains and found two large factors that did not correlate with the g factor extracted from the GATB. Most likely, the score gains are not on the g factor or the broad abilities, but on the test specificities, since te Nijenhuis et al. showed that practice and coaching reduce the g-loadedness of their tests. In a third study (Coyle, 2006), factor analysis demonstrated that the change in aptitude test scores had a zero loading on the g factor.

8. Method

Psychometric meta-analysis (Hunter & Schmidt, 1990) aims to estimate what the results of studies would have been if all studies had been conducted without methodological limitations or flaws. The results of perfectly conducted studies would allow a less obstructed view of the underlying construct-level relationships (Schmidt & Hunter, 1999). One of the goals of the present meta-analysis is to have a reliable estimate of the true correlation between standardized test–retest score gains (d) and g. Although the construct of g has been thoroughly studied, the construct underlying score gains is less well understood. One of the aims of the present study is to have a clearer understanding of the construct underlying score gains by linking it to the g nexus. Carrying out a complete meta-analysis on the relationship between d and g would require the collection of a very large number of datasets. However, applying meta-analytical techniques to a sufficiently large number of studies will also lead to a reliable estimate of the true correlation between d and g. We therefore collected a large number of studies heterogeneous across various possible moderators.

To get a reliable correlation between g and d, we focused on batteries with a minimum of seven subtests. Libraries and test libraries of universities were searched and several members of the Dutch Testing Commission and test publishers were contacted. We limited ourselves to non-clinical samples, without health problems. Only a minority of test manuals report test–retest studies; especially before 1970 they are rare. The search yielded virtually all test–retest studies available in the Netherlands. The GATB manual (1970, ch. 20) reports very large datasets on secondary school children who took the GATB with respectively 1-, 2-, and 3-year intervals. At the time of the first test, large samples of children that had the same age as the test–retest children at the time of the second test also took the test. Through a comparison of the scores, the maturation effects could be separated from the test–retest effects, so we included the data in the present study.

Standardized score gains were computed by dividing the raw score gain by the S.D. of the pretest. In general, g loadings were computed by submitting a correlation matrix to a principal axis factor analysis and using the loadings of the subtests on the first unrotated factor. In some cases, g loadings were taken from studies where other procedures were followed; these procedures have been shown empirically to lead to highly comparable results. Pearson correlations between the standardized score gains and the g loadings were computed.

8.1. Correction for sampling error

In many cases, sampling error explains the majority of the variation between studies, so the first step in a psychometric meta-analysis is to correct the collection of effect sizes for differences in sample size between the studies.

8.2. Correction for reliability of the vector of g loadings

The value of r_{gd} is attenuated by the reliability of the vector of g loadings for a given battery. When two samples have a comparable N, the average correlation between vectors is an estimate of the reliability of each vector. The collection of datasets in the present study included no g vectors for the same battery from different samples and therefore artifact distributions were based upon other studies reporting g vectors for two or more samples. So, the effect sizes and the distribution of reliabilities of the g vector were based upon different samples. When two g vectors were compared the correlation between them was used, and when more than two g vectors were compared the average correlation for the various combinations of two vectors was used. The combined N from the samples on which the g vector was based was taken as the weight of one data point.

Several samples were compared that differed little on background variables. For the comparisons using children, we chose samples that were highly comparable with regard to age and, for the comparisons of adults, we chose samples that were roughly comparable with regard to age. In a study on young children, Schroots and van Alphen de Veer (1979) report correlation matrices for the Leidse Diagnostische Test for eight age groups between 4 and 8 years of age. The average correlation between the adjacent age groups is .75 (combined N=1169). Several studies report data on both younger and older children. The Dutch/Flemish WISC-R (van Haasen et al., 1986) has samples with comparable N of Dutch and Flemish children, so the 11 age groups between 6 and 16 could be compared. This resulted in an average correlation of .78 (combined N=3018). Jensen (1985) reports g loadings of the 12 subtests of the WISC-R obtained in three large independent representative samples of Black and White children. The average correlation between the g vectors obtained for each sample is .86 for the Black children (combined N=1238) and .93 for the White children (combined N=2868). In a study on older children, Evers and Lucassen (1991) report the correlation matrices of the Dutch DAT. The average correlation between the g vectors of three educational groups is .88 (combined N=3300). The US GATB manual (1970, chapter 20) gives correlation matrices for large groups of boys and girls in secondary school. The average correlation between the g vectors of the same-age boys and girls is .97 (combined N=26,708) Several studies report data on adults. g loadings of the eight subtests of the GATB are reported by te Nijenhuis and van der Flier (1997) for applicants at Dutch Railways and by de Wolff and Buiten (1963) for seamen at the Royal Dutch Navy, resulting in a correlation of .90 (combined N=1306). The US GATB manual (1970) gives correlation matrices for two large groups of adults, which yields a correlation between g vectors of .94 (combined N=4519). Johnson, Bouchard, Krueger, McGue, and Gottesman (2004) report g loadings for a sample that took the WAIS, and Wechsler (1955) reports the correlation matrices of the WAIS for adults of comparable age, so g loadings could be computed. The correlation between the g vectors for the two studies is .72 (combined N=736). So, it appears that g vectors are quite reliable, especially when the samples are very large.

The number of tests in the batteries in the present study varied from 7 to 14. The number of tests does not necessarily influence the size of r_{gd}, but clearly has an effect upon its variability. Because variability in the values of the artifacts influences the amount of variance artifacts explain in observed effect sizes, we estimated this variability using data from the samples described in the previous paragraph.

8.3. Correction for reliability of the vector of score gains

The value of r_{gd} is attenuated by the reliability of the vector of score gains for a given battery. When two samples have a comparable N, the average correlation between vectors is an estimate of the reliability of each vector. The reliability of the vector of score gains was estimated using the present datasets, comparing samples that took the same test and that differed little on background variables. For the comparisons using children, we choose samples that were highly comparable with regard to age and for the comparisons of adults we choose samples that were roughly comparable with regard to age. In the GATB manual (1970, ch. 15), 13 combinations of two studies are described where large samples of men and women that are comparable with respect to age and background took the same GATB subtests. The average unweighted correlation between the d vectors of men and women is .83 (total N=3760). In the GATB manual (1970, ch. 20), three combinations of three studies are described where very large samples of boys and girls that are in the same grade in secondary school took the same GATB subtests. This yielded correlations between the d vectors of, respectively, .99, .98, and .94 (total N=20,541). Together, van Geffen (1972) and Bosch (1973) report three Dutch GATB test–retest studies on children in secondary school, resulting in three comparisons between d vectors. The average N-weighted correlation between the d vectors is .47 (total N=127). Vectors of score gains from two different datasets on the WISC-R were compared. Tuma and Appelbaum (1980) tested children with an average age of 10, and Wechsler (1974) tested 10- and 11-year-olds. The correlation between the two d vectors is .71 (total N=147). Comparison of vectors of score gains from datasets on the DAT (Bennett, Seashore, & Wesman, 1974) resulted in correlations of, respectively, .78 and .73, so an average r of .76 (total N=254). So, it appears that d vectors are quite reliable, especially when the samples are very large. We estimated the reliabilities of the d vectors in the database using data from the samples described in this paragraph.

8.4. Correction for restriction of range of g loadings

The value of r_{gd} is attenuated by the restriction of range of g loadings in many of the standard test batteries. The most highly g-loaded batteries tend to have the smallest range of variation in the subtests’ g loadings. Jensen (1998a, pp. 381–382) shows that restriction in g loadedness strongly attenuates the correlation between g loadings and standardized group differences. Hunter and Schmidt (1990, pp. 47–49) state that the solution to range variation is to define a reference population and express all correlations in terms of that reference population. The Hunter and Schmidt meta-analytical program computes what the correlation in a given population would be if the standard deviation were the same as in the reference population. The standard deviations can be compared by dividing the study population standard deviation by the reference group population standard deviation, that is u=S.D.study/S.D.ref. As the reference we took the tests that are broadly regarded as exemplary for the measurement of the intelligence domain, namely the various versions of the Wechsler tests for children. The average standard deviation of g loadings of the various Dutch and US versions of the WISC-R and the WISC-III was 0.128. So, the S.D. of g loadings of all test batteries was compared to the average S.D. in g loadings in the Wechsler tests for children. This resulted in some batteries – such as the GATB – having a value of u larger than 1.00.

8.5. Correction for deviation from perfect construct validity

The deviation from perfect construct validity in g attenuates the value of r_{gd}. In making up any collection of cognitive tests, we do not have a perfectly representative sample of the entire universe of all possible cognitive tests. So any one limited sample of tests will not yield exactly the same g as any other limited sample. The sample values of g are affected by psychometric sampling error, but the fact that g is very substantially correlated across different test batteries implies that the differing obtained values of g can all be interpreted as estimates of a “true” g. The value of r_{gd} is attenuated by psychometric sampling error in each of the batteries from which a g factor has been extracted.

The more tests and the higher their g loadings, the higher the g saturation of the composite score. The Wechsler tests have a large number of subtests with quite high g loadings resulting in a highly g-saturated composite score. Jensen (1998a, pp. 90–91) states that the g score of the Wechsler tests correlate more than .95 with the tests’ IQ score. However, shorter batteries with a substantial number of tests with lower g loadings will lead to a composite with a somewhat lower g saturation. Jensen (1998a, ch. 10) states that the average g loading of an IQ score as measured by various standard IQ tests is in the +.80 s. When we take this value as an indication of the degree to which an IQ score is a reflection of “true” g, we can estimate that a tests’ g score correlates about .85 with “true” g. As g loadings are the correlations of tests with the g score, it is most likely that most empirical g loadings will underestimate “true” g loadings; so, empirical g loadings correlate about .85 with “true” g loadings. As the Schmidt and Le computer program only includes corrections for the first four artifacts the correction for deviation from perfect construct validity was carried out on the value of r_{gd} after correction for the first four artifacts. To limit the risk of overcorrection, we conservatively chose the value of .90 for the correction.

9. Results

The results of the studies on the correlation between g loadings and gain scores are shown in Table 1. The table gives data derived from 64 studies, with participants numbering a total of 26,990. … It is clear that virtually all correlations are negative and that the size of the few positive correlations is very small.

Table 2 shows the results of the psychometric meta-analysis of the 64 data points. It shows (from left to right): the number of correlation coefficients (K), total sample size (N), the mean observed correlations (r) and their standard deviation (S.D.r), the true correlations one can expect once artifactual error from unreliability in the g vector and the d vector and range restriction in the g vector has been removed (ρ) and their standard deviation (S.D.ρ). The next two columns present the percentage of variance explained by artifactual errors (%VE) and the 95% credibility interval (95% CI). This interval denotes the values one can expect for ρ in 19 out of 20 cases.

The large number of data points and the very large sample size indicate that we can have confidence in the outcomes of this meta-analysis. The estimated true correlation has a value of −.95 and 81% of the variance in the observed correlations is explained by artifactual errors. However, Hunter and Schmidt (1990) state that extreme outliers should be left out of the analyses, because they are most likely the result of errors in the data. They also argue that strong outliers artificially inflate the S.D. of effect sizes and thereby reduce the amount of variance that artifacts can explain. We chose to leave out three outliers – more than 4 S.D. below the average r and more than 8 S.D. below ρ – comprising 1% of the research participants.

This resulted in no changes in the value of the true correlation, a large decrease in the S.D. of ρ with 74%, and a large increase in the amount of variance explained in the observed correlations by artifacts by 22%. So, when the three outliers are excluded, artifacts explain virtually all of the variance in the observed correlations. Finally, a correction for deviation from perfect construct validity in g took place, using a conservative value of .90. This resulted in a value of −1.06 for the final estimated true correlation between g loadings and score gains. Applying several corrections in a meta-analysis may lead to correlations that are larger than 1.00 or −1.00, as is the case here. Percentages of variance accounted for by artifacts larger than 100% are also not uncommon in psychometric meta-analysis. They also do occur in other methods of statistical estimation (see Hunter & Schmidt, 1990, pp. 411–414 for a discussion).

10. Discussion

A large-scale meta-analysis of 64 test–retest studies shows that after corrections for several artifacts there is an estimated true correlation of −1.06 between g loading of tests and score gains and virtually all of the variance in observed correlations is attributable to these artifacts. As several artifacts explain virtually all the variance in the effect sizes, other dimensions on which the studies differ, such as age of the test takers, test–retest interval, test used, average-IQ samples, or samples with learning problems, play no role at all.

The estimated true correlation of −1.06 is the result of various corrections for artifacts that attenuate the correlations. The estimated values of the artifacts may underestimate or overestimate the population values of the artifacts. Therefore, estimates of true effect sizes may overestimate or underestimate the population values of the effect size. As a solution to this problem, Hunter and Schmidt (2004) suggest carrying out several meta-analyses on the same construct and taking the average estimated effect size of all meta-analyses. The general idea is that meta-analysis is a powerful research tool, but does not give perfect outcomes.

A correlation of −1.06 falls outside the range of acceptable values of a correlation, but one has to make a distinction between the meta-analytical estimate of the true correlation between g and d, and the true correlation between g and d. We interpret the value of −1.06 for the meta-analytical estimate as meaning that the true correlation between g and d is −1.00. A correlation of −1.00 means that there is an inverse relationship between g and score gains. So, the tests with the highest g loadings show the smallest gains. The most straightforward interpretation of this very large negative correlation is that there is no g saturation in test–retest gain scores.

11. The South African learning potential study

In a carefully carried-out study, Skuy et al. (2002) used a dynamic testing procedure to see whether it would improve the scores of Black South African students on Raven’s Standard Progressive Matrices (RSPM). […] the correlation of the RSPM scores with performance in the end-of-year psychology examination did not significantly improve after mediation. Once again, the score gains were empty; they did not generalize.

14. Measures and cognitive intervention

It [RSPM] has been established as one of the purest measures of g (Jensen, 1998a). Skuy et al. (2002) found no evidence for test bias against Blacks in South African education. Rushton, Skuy, and Bons (2004) showed that the Raven’s gave comparable predictive validities for students from various groups.

15.4. Correlation between sum scores and score gains

We tested whether individuals with low-g improved their scores more than those with high-g by correlating gain scores with pretest RSPM scores for each of the four research groups. As gain scores tend to be negatively correlated with pretest scores as a function of unreliability (see Cronbach, 1990; Nunnally & Bernstein, 1994), we corrected the correlations using Tucker, Damarin, and Messick’s (1966) formula 63. Using the formula, one adds to each correlation the term (S.D. pretest/S.D. gain score) * (1−reliability pretest).

16. Results

16.2. Correlation between score gains and g loadedness

We estimated effect sizes for each of the four groups (race by condition) by computing the difference between mean pretest scores and posttest scores, divided by the standard deviation of the pretest scores of Black and White/Indian/Colored students, respectively. Finally, we calculated the correlations between effect sizes and the g loadings taken from Lynn et al. Correlations were −.24 (p=.10) for the Black experimental group, −.21 (p=.20) for the White/Indian/Colored experimental group, −.08 (p=.59) for the Black control group, and −.41 (p=.01) for the White/Indian/Colored control group. Small sample sizes usually attenuate correlations (Hunter & Schmidt, 1990). Collapsing the groups indeed resulted in higher average correlations: −.39 for the complete experimental group and −.26 for the complete control group.

16.3. g loadings

Using the combined experimental and control group, a principle axis factor analysis on the pretest and posttest scores, respectively, resulted in a first unrotated factor explaining 22% of the variance in the pretest scores and 18% of the variance in the posttest scores. These findings suggest that the g loadedness of the RSPM decreased substantially after Mediated Learning Experience.

16.4. Correlation between score gains and sum score

Correlating score gains with RSPM total scores resulted in values of −.60 (p=.00) for the Black experimental group, −.18 (p=.38) for the Black control group, −.82 (p=.00) for the White/Indian/Colored experimental group, and −.48 (p=.08) for the White/Indian/Colored control group. After the use of the correction formula of Tucker et al. (1966), these correlations became −.39, −.08, −.61, and −.35, respectively. Overall, these correlations show that low-g persons improved their scores more strongly than high-g persons.

17. Discussion

Skuy et al. (2002) hypothesized that the low-quality education of Blacks in South Africa would lead to an underestimate of their cognitive abilities by IQ tests. Groups of Black and White/Indian/Colored students took the Raven’s Progressive Matrices twice, and in between received Feuerstein’s Mediated Learning Experience. The test scores went up substantially in all groups. Evidence for an authentic change in the g factor requires broad transfer or generalizability across a wide variety of cognitive performance. However, Skuy et al. show that the gains did not generalize to scores on an other, highly similar test and to external criteria, and were therefore hollow. As the score gains were in some cases quite large – 14 IQ points for the Black experimental group – the question becomes what is it that improved.

The findings show that the correlations between score gains and g loadedness of the items were −.39 for the complete experimental group and −.26 for the complete control group. However, because the g loadings and gain scores are measured at the item level their reliabilities are not high, resulting in substantial attenuation of the correlation between g and d. Moreover, RSPM does not measure g perfectly: Jensen (1998a, p. 91) estimates its g loading at .83. When we estimate the reliability of the g vector at .70 and the reliability of the gain score vector at .50, corrections for unreliability and deviation from perfect construct validity of g only would result in estimated true correlations of, respectively, −.80 and −.53. These values should be taken as underestimates; controlling for additional artifacts will bring them closer to the very strong negative correlation found in the meta-analysis.

The findings suggest that after training the g loadedness of the test decreased substantially. We found negative, substantial correlations between gain scores and RSPM total scores. Table 4 shows that the total score variance decreased after training, which is in line with low-g subjects increasing more than high-g subjects. Since, as a rule, high-g individuals profit the most from training – as is reflected in the ubiquitous positive correlation between IQ scores and training performance (Jensen, 1980; Schmidt & Hunter, 1998) – these findings could be interpreted as an indication that Feuerstein’s Mediated Learning Experience is not g-loaded, in contrast with regular trainings that are clearly g-loaded. Substantial, negative correlations between gain scores and RSPM total scores are no definite proof of this hypothesis, but are in line with it. Additional substantiation of our hypothesis that the Feuerstein training has no or little g loadedness is that Coyle (2006) showed that gain scores loaded virtually zero on the g factor. Moreover, Skuy et al. reported that the predictive validity of their measure did not increase when the second Raven score was used. The fact that individuals with low-g gained more than those with high-g could be interpreted as an indication that the Mediated Learning Experience was not g-loaded. It should be noted, however, that Feuerstein most likely did not intend his intervention to be g-loaded. He was interested in increasing the performance of low scorers on both tests and external criteria.

18. General discussion

What conclusions can be drawn from such score gains? Jensen’s (1998a) hypothesis that the effects of training on abilities can be summarized in terms of Carroll’s three-stratum hierarchical factor model was tested in a meta-analysis on test–retest data using Dutch, British, and American test batteries, and with learning potential data from South Africa using Raven’s Progressive Matrices. The meta-analysis convincingly shows that test–retest score gains are not g-loaded. The findings from the learning potential study are clearly in line with this: when the attenuation caused by unreliability and other artifacts is taken into account the correlation between g loadings of items and gains on items has a value that is somewhat comparable to the one found in the meta-analysis for test batteries. The data suggest that the g loadedness of item scores decreases after the intervention training. Te Nijenhuis et al.’s (2001) finding that practice and coaching reduced the g-loadedness of their test scores strengthens the present findings using item scores. The findings show that not the high-g participants increase their scores the most – as is common in training situations – but it is the low-g persons showing the largest increases of their scores. This suggests that the intervention training is not g-loaded.

19. Limitations of the studies

Our meta-analysis and our analysis of the South African study are strongly based on the method of correlated vectors (MCV), and recently it has been shown to have limitations. Dolan and Lubke (2001) have shown that when comparing groups substantial positive vector correlations can still be obtained even when groups differ not only on g, but also on factors uncorrelated with g. Ashton and Lee (2005) show that associations of a variable with non-g sources of variance can produce a vector correlation of zero even when the variable is strongly associated with g. They suggest that the g loadings of a subtest are sensitive to the nature of the other subtest in a battery, so that a specific sample of subtests may cause a spurious correlation between the vectors. Notwithstanding these limitations, studies using MCV continue to appear (see, for instance, Colom, Haier, & Jung, in press; Hartmann, Kruuse, & Nyborg, in press; Lee et al., 2006). The outcomes of our meta-analysis of a large number of studies using the method of correlated vectors may make an interesting contribution to the discussion on the limitations of the method of correlated vectors.

A principle of meta-analysis is that the amount of information contained in one individual study is quite modest. Therefore, one should carry out an analysis of all studies on one topic and correct for artifacts, leading to a strong increase of the amount of information. The fact that our meta-analytical value of r=−1.06 is virtually identical to the theoretically expected correlation between g and d of −1.00 holds some promise that a psychometric meta-analysis of studies using MCV is a powerful way of reducing some of the limitations of MCV. An alternative methodological approach is to limit oneself to the rare datasets enabling the use of structural equations modeling. However, from a meta-analytical point of view, these studies yield only a quite modest amount of information.

Additional meta-analyses of studies employing MCV are necessary to establish the validity of the combination of MCV and psychometric meta-analysis. Most likely, many would agree that a high positive meta-analytical correlation between measures of g and measures of another construct implies that g plays a major role, and that a meta-analytical correlation of −1.00 implies that g plays no role. However, it is not clear what value of the meta-analytical correlation to expect from MCV when g plays only a modest role. After the present meta-analysis on a construct that clearly has an inverse relationship with g, it would be informative to carry out meta-analyses of studies on variables that are strongly linked to g and variables that are modestly linked to g.