Joep Dragt (master thesis, 2010).
Study 1: Effect of Language Bias in Subtests
When comparing test scores of people who lack a desirable level of proficiency in the target language and bilinguals (i.e., most immigrants) against the test scores of native speakers, a distinction is usually made between verbal and nonverbal tests. Subtests with a substantial verbal component measure to an undesirable extent proficiency in the language of the test taken and underestimate the level of g of the tested nonnative speakers (see te Nijenhuis & van der Flier  for a review of Dutch studies). The more limited the language skills, the larger the underestimate. Language bias plays a clear role in the testing of immigrants in Europe, but also in the testing of Blacks in South Africa, where the English used in the test, it is sometimes the second or even third language of the Black test taker.
In a study of Dutch immigrants, using a mixture of culture-loaded and culture-reduced tests te Nijenhuis and van der Flier (2003) found that the highly verbal subtest Vocabulary of the GATB is so strongly biased that it depresses the score on Vocabulary by 0.92 SD, leading to an underestimate of g based on GATB IQ, with as much as 1.8 IQ points due to this single biased subtest alone, whereas the other 7 subtests combined show only very little bias. However, one should not forget that subtests with a strong verbal component usually constitute only a small part of a test battery; due to the use of sum scores the strong bias in tests with a verbal component becomes diluted.
Looking at the effect of length of residence in the Netherlands on the scores on various intelligence tests also shows the influence of language. Tests without a verbal component show small to negligible correlations with length of residence, tests with a verbal component show moderate correlations, while language proficiency tests show large correlations (see te Nijenhuis & van der Flier, 1999; see van den Berg, 2001, p. 37). All these findings regarding the clear but modest role of language bias are in line with the findings of language bias when testing Hispanics who do not have a desirable level of proficiency in the target language or who are bilingual (Lopez, 1997; Pennock-Román, 1992).
In the US studies on Spearman’s hypothesis it is usually native-born Blacks and Whites who are compared. Therefore language bias is not the problem that it is in the study of immigrants, and Blacks and Whites in South Africa. However, studies of Hispanic immigrants may show language bias. In order to combine the diverse studies for a meta-analysis the effects of language bias had to be taken into account. We did this by leaving out subtests with a substantial language component for immigrants in Europe; Blacks in South Africa; and Mexican immigrants in the US for some studies. When there were still at least seven subtests left we recomputed the correlation between d and g and included that data point in the meta-analyses. Therefore Table 5 of Study 3 in some cases shows two correlations between d and g: one for all subtests and another after excluding one or more subtests with a substantial language component.
What follows is a detailed description by IQ-test battery of the subtests we excluded from the analyses due to language bias or potential language bias. First, were the Dutch studies of majority group members and immigrants. In the Dutch RAKIT (Helms-Lorenz et al., 2003; te Nijenhuis et al., 2004; Tolboom, 2000) four subtests were identified as having a substantial language component: Verbal meaning, Learning names, Idea production, and Storytelling. Therefore these subtests were left out of the analysis. Figures 3-10 show several scatter plots of subtest patterns before and after language biased subtests were removed for several immigrant groups on the RAKIT. The Dutch IQ-test battery GATB (te Nijenhuis & van der Flier, 1997, 2005) includes a single subtest with a substantial language bias, namely Vocabulary, and therefore this subtest was left out of the analyses. Furthermore, in the Dutch DAT (te Nijenhuis et al., 2000) two subtests were identified as having a substantial language bias: Vocabulary and Language usage. These subtests were left out of the analyses.
Lynn and Owen (1994) used the JAT (Junior Aptitude Tests) to compare Indians and Blacks with Whites in South Africa. Three subtests were identified as having a substantial language bias: Reasoning, Synonyms, and Memory (paragraphs), and therefore were left out of the analyses.
Valencia and Rankin (1986) compared Mexican-Americans and Anglo-Americans on the K-ABC. The K-ABC (Kaufman Assessment Battery for Children) consists of ten mental processing subtests divided into a sequential (three subtests) and a simultaneous (seven subtests) processing scale. An achievement scale is also present on the K-ABC and consists of seven subtests that cover vocabulary, language development, general factual knowledge, mental arithmetic, and reading. Many of the K-ABC achievement scale subtests are commonly viewed on other tests as measures of verbal intelligence (Reynolds, 1994). Therefore these subtests can negatively influence performance for groups that have taken the test in a nonnative language. Since the Faces and Places, Arithmetic, Riddles, Reading Decoding, and Reading Comprehension subtests may show language bias, we omitted these five subtests from the analyses.
Underestimation of IQ due to language bias
After having identified the subtests with a substantial language bias, the next step was to compute the degree to which these subtests disadvantage the people who do not have a desirable level of proficiency in the target language and bilinguals (i.e., most immigrants). Wicherts (2007) argued that a commonly used Dutch IQ test, the RAKIT, underestimates IQ of ethnic minority children about 7 points. Several IQ batteries contain subtests with language bias as we have shown above. The calculation of the underestimation of IQ due to language bias required several steps. First, after removing the language-biased subtests the new linear regression formula of d scores on g loadings was computed. This results in a regression line and a regression formula not distorted by language bias. Second, g loadings of the language-biased subtests of the IQ battery were separately entered in the new regression formula, resulting in a value of d for the data point expected solely on its g-loadedness, that is, without the influence of language bias. All these computed values of d were on the regression line. Third, this value was then subtracted from the d value still containing the language bias resulting in a value for the effect of language bias for this specific subtest. Fourth, the sum of these outcomes for all language-biased subtests in a specific battery was taken and then this sum was divided by the total number of subtests administered in the study, thereby also including the language-biased subtests. The result is an estimate, expressed in SDs, of how much the language-biased subtests depress the total IQ score of the battery in question. Table 4 shows the underestimation in IQ points for all the different test batteries in our study identified as comprising language-biased subtests.
Conclusion and Discussion
When comparing different groups, language bias has to be taken into account, because the IQ of people who do not have a desirable level of proficiency in the target language is underestimated. However, this underestimation of IQ appears to be much smaller than the 7 points claimed by Wicherts (2007): the mean of the underestimation of IQ in the Dutch RAKIT samples is only 3.08 IQ points. The mean underestimation of IQ in for all studies in Table 4 is even lower, namely 2.71 IQ points. However, a clear exception is the Kaufman-ABC’s underestimate of the IQ of Hispanics with more than ten IQ points, a strong effect.
It is clear that when testing Spearman’s hypothesis language-biased subtests within a battery obscure the outcomes. For instance, the study by Helms-Lorenz et al. (2004) shows no support for Spearman’s hypothesis. However removing the subtests with language bias may alter the author’s conclusions. So, in the meta-analysis on Spearman’s hypothesis subtests with language bias were taken out.
Study 3: Psychometric meta-analysis of Spearman’s hypothesis
Spearman’s hypothesis states that the different relative magnitudes of the Black/White differences on various tests are a function of each test’s g loading. This hypothesis has since been tested in numerous studies in the US, Europe, Asia, and Africa. However, a meta-analysis on this topic had not previously been conducted. Therefore, in this paper we report the results of a psychometric meta-analysis of Spearman’s hypothesis.
Results Study 3: Psychometric meta-analysis on Spearman’s hypothesis
The results of the studies on the correlation between g loadings and the score differences between groups (d) are shown in Table 5. The Table gives data derived from twenty-six studies, comprising a number of thirty-eight data points, with participants numbering a total of 67,715. The Table also lists the reference for the study, the cognitive ability test used, the groups that were compared, the correlation between g loadings and d, the harmonic mean, and the mean age (and range of age). It is clear that the large majority of the correlations are strongly positive.
Wechsler test batteries as a standard for restriction of range
Table 6 presents the results of the psychometric meta-analysis of the thirty-eight data points where the Wechsler test batteries have been used as the standard for the correction for restriction of range. It shows (from left to right): the number of correlation coefficients (K), total sample size (N), the mean observed correlations (r) and their standard deviation (SDr), the correlations one can expect once artifactual error from unreliability in the g vector, the d vector, and range restriction in the g vector have been removed (rho-4), and their standard deviation (SDrho-4), and the true correlation one can expect when corrections for all five artifacts have been carried out (rho-5). The next two columns present the percentage of variance explained by artifactual errors (%VE), and the 80% confidence interval (80% CI). This interval denotes the values one can expect for rho-4 in sixteen out of twenty cases.
The analysis of all 38 data points yields an estimated correlation (rho-4) of .57, with only 24% of the variance in the observed correlations explained by artifactual errors. However, Hunter and Schmidt (1990) state that extreme outliers should be left out of the analysis, because they are most likely the result of errors in the data. They also argue that extreme outliers artificially inflate the SD of effect sizes and thereby reduce the amount of variance that artifacts can explain.
There are statistical reasons and theoretical reasons to exclude outliers and extreme outliers. A first statistical reason for exclusion is when a data point falls several standard deviations below the mean of the sample of data points without the outliers. A second statistical ground to exclude a data point is when the distribution of data points in the scatter plot is highly uneven, meaning there are large gaps between adjacent data points. A theoretical reason to exclude a data point is when a dataset is dissimilar from all other datasets, for instance when the group in question is quite different from the groups in the other studies. The strongest case for excluding data points is when there are both good statistical and theoretical reasons.
Figure 12 shows the scatter plot of all correlations r (d x g) against the harmonic mean. We choose to first leave out three extreme outliers, with a value of r more than 10 SD beneath the average r of the final sample of thirty-five data points. The studies by Department of Defense (1982) and Lynn and Owen’s (1994) data on Whites/Indians, and Whites/Blacks were considered extreme outliers and therefore omitted from the analysis. These are studies with large sample sizes, so meta-analytical theory predicts a high correlation and not the small correlations reported by the authors of the studies. Removing these data points resulted in a substantial change in the value of the correlation (rho-4), a large decrease in the SD of rho-4, and a huge increase in the amount of variance explained in rho-4 by artifacts: 77 % of the variance is now explained. We also checked what would happen when instead of three, six outliers were removed. The studies by Jensen and Faulstich (1988), Valencia and Rankin (1986), and Tolboom’s (2000) data on 5.8-year old Dutch and Moroccan testess were excluded because there was a huge gap between these studies and the adjacent data points (see Figure 12). The study by Jensen and Faulstich (1988) investigated White and Black prisoners and they even found a negative correlation between d and g. The value of rho-4 did not change drastically after removing another three data points, but the SD of rho-4 decreased to a value of 0.0, and the percentage of variance explained by artifacts increased to 130. Finally, a correction for deviation from perfect construct validity in g took place, using the value of 7.5%. This resulted in a value of .71 for the final estimated true correlation between g loadings and group differences.
Group Identity as Moderator
The final estimated true correlation of .71 could be moderated by the groups that have been compared in our analysis of all IQ batteries. Different groups (Blacks, Whites, Native-Americans, Hispanics) within the US have been compared and groups from Europe and Africa have been included in the meta-analysis. To test whether the specific group identity moderates the relation between group differences and g, first, the dataset has been split up into four subsets. This resulted in 1) a subset of studies on Black/White differences, 2) a subset of studies on White/Hispanic differences, 3) a cluster of studies on Dutch/immigrant differences, and 4) a subset of studies on group differences in Africa. This last subset contained only three studies, so this cluster was not used for the moderator analyses (see Table 6). Additionally, the Black/White and the White/Hispanic subsets were combined to see the results for all North-American studies. The results from these moderator analyses are compared with the results of the outcomes of the psychometric meta-analysis of all IQ batteries minus six outliers. That is, the values of rho-4 and the percentage of variance from the moderator analyses are compared to the dataset of Spearman’s hypothesis on IQ batteries where outliers have been left out of the analysis. For the Black/White cluster the same outliers that have been excluded from the initial number of 38 data points were also excluded in these moderator analyses. This resulted in a Black/White cluster of 11 data points (excluding Department of Defense, 1982; and Jensen & Faulstich, 1988), and a Hispanic/White cluster consisting of four data points, which included the study by Valencia and Rankin (1986). The immigrant/Dutch cluster consisted of all the studies conducted in the Netherlands. It could be that clustering the subsets would moderate the data resulting in a value of the percentage of variance explained that is closer to the theoretically optimal value of 100. It may also be the case that the values of rho-4 differ substantially by group. The results were that in the first dataset on Black and White groups in the US rho-4 decreases from .66 to .63. In the second cluster of Hispanics/Whites rho-4 increases from .66 to .75. In the last subset of studies containing immigrant and Dutch groups rho-4 decreases to .63, but with a substantially lower percentage of variance explained by the four artifacts: 38. These results show that the value of rho-4 is quite similar in the various groups and thereby disconfirm the hypothesis that different group identity used in the overall meta-analysis act as a moderator.
Test battery as a moderator
Because of the substantially lower outcomes of the RAKIT within the Dutch cluster of studies on immigrants and Dutch we hypothesized that there was a another specific effect operating for this group of studies, namely type of test battery. We hypothesized that when a division would be made within the Dutch studies between the Dutch RAKIT studies and the Dutch studies where another test battery had been used it would result in a larger percentage of variance explained within each group. The collection called ‘Dutch other studies’ includes all studies using any IQ battery other than the RAKIT. The study by Helms-Lorenz et al. (2003) is not entered in one of the two separate analyses because one of the two IQ batteries used was the RAKIT, and so the results could have been contaminated. The results from this moderator analysis are compared with the results of the outcome of the psychometric meta-analysis of all Dutch studies. In the group of RAKIT studies rho-4 remains the same, the SD of rho decreased from .19 to .13, and the percentage variance explained increased from 38 to 64. When one outlier is removed, namely Tolboom’s (2000) data on 5.8-year-old Dutch and Moroccans, the amount of variance explained increased dramatically to 119. In the ‘Dutch other studies’ cluster rho-4 is much higher (.72) than for Dutch all studies (.63), but the SD of rho is much lower, namely .07. However, the percentage of variance explained is still low, namely 48. When removing one outlier, namely the Surinamese group from te Nijenhuis and van der Flier (1997), rho-4 increased even more, namely a value of .79. The SD of rho decreased further to .04 and the percentage of variance explained increased drastically to 164.
Dutch GATB as a standard for restriction of range
Table 7 presents the results of the psychometric meta-analysis of the thirty-eight data points where the Dutch GATB has been used as the standard for the correction for restriction of range. It has the same format as Table 6. The analysis of all 38 data points yields an estimated correlation (rho-4) of .73, with only 18% of the variance in the observed correlations explained by artifactual errors. The same extreme outliers were left out of this analysis. After first leaving out three extreme outliers, the correlation (rho-4) increased to a value of .84, the SD of rho-4 decreased drastically, and the amount of variance explained in rho-4 by artifacts increased to 58%. We also checked what would happen when instead of three, six outliers were removed. The value of rho-4 did not change drastically, but the SD of rho-4 decreased drastically to a value of .01, and the percentage of variance explained by artifacts increased to 98. Finally, a correction for deviation from perfect construct validity in g took place, using the value of 7.5%. This resulted in a value of .91 for the final estimated true correlation between g loadings and group differences.
Group Identity as Moderator
Although artifacts explained 98% of the variance in the data points, the final estimated true correlation of .91 could in theory be moderated by the groups that have been compared in our analysis of all IQ batteries. Groups were formed in the same way and the same outliers were excluded as in the previous set of analyses. In the first dataset on Black and White groups in the US rho-4 remains at a value of .85. In the second cluster of Hispanics/Whites rho-4 increases from .85 to .89. In the last group of studies containing Dutch and immigrant groups rho-4 decreases to .78, but with a substantial lower percentage of variance explained by the four artifacts: 30%. These results disconfirm the hypothesis that different groups used in the overall meta-analysis act as a moderator: the values of rho-4 in the three groups are highly similar.
Test battery as a moderator
The composition of the groups within this analysis are composed is the same as in the previous set of analyses. In the group of RAKIT studies rho-4 decreased to .74, the SD of rho decreased from .15 to .12, and the variance explained increased from 30% to 66%. When one outlier is removed, namely Tolboom’s (2000) 5.8-year-old Dutch and Moroccans, the amount of variance explained increased dramatically to 123 %. In the Dutch other studies cluster rho-4 is much higher (.87) than for Dutch all studies (.78). The SD of rho is much lower, namely .03. Moreover, the variance explained increased as well, namely to 70%. When removing one outlier, namely te Nijenhuis and van der Flier’s (1997) Surinamese group, rho-4 increased even more, namely to .91. The SD of rho decreased further to 0 and the variance explained increased drastically to 288%.
Tables 6 and 7 show percentages of variance explained as being larger than 100%. This phenomenon is called “second-order sampling error”, and results from the sampling of studies in a meta-analysis. Percentages of variance explained greater than 100% are not uncommon when only a limited number of studies are included in an analysis. The proper conclusion is that all the variance is explained by statistical artifacts (see Hunter & Schmidt, 2004, pp. 399-401, for an extensive discussion).
Bare-bones analysis of educational and training criteria
A bare-bones psychometric meta-analysis was carried out to estimate the size of the correlation between group differences in educational and training criteria and g loadings. A barebones psychometric meta-analysis estimates how much of the observed variance in findings across studies is due to sample size alone. Criteria with a substantial language bias were removed from the analysis in a similar manner as with the IQ batteries; an example is given in Figure 13 and Figure 14. The results of the studies on the correlation between g loadings and score differences in educational and training criteria are shown in Table 8. Mostly high correlations are found between g loadings and differences in educational and training criteria.
Table 9 shows that the bare-bones meta-analysis yields a correlation between group differences in educational and training criteria and g loadings of .67, using only one simple correction for sample size. The SD of r is large, namely a value of .27. Moreover, the variance accounted for by artifactual errors is a negligible at 4 %. When two statistical outliers are removed, namely Tolboom’s (2000) data on 9.8-year-old Turks and Moroccans, the correlation increased to a value of .79 but the variance explained is still low, namely 17 %. Most likely this is a result of the fact that all groups have a highly comparable sample size, meaning there is a severe restriction of range in N.