Jan te Nijenhuis, Henk van der Flier (2003)
When a positive correlation occurs between g-loadedness of a cognitive test and variable X, the result is termed a ‘‘Jensen effect.’’ Virtually all Dutch studies comparing immigrants with Dutch show that the differences in mean scores on intelligence tests are dependent on the g loading of the tests and are therefore clear Jensen effects. However, Helms-Lorenz, van de Vijver, and Poortinga [Intelligence 31 (2003) 9] challenge the overwhelming finding of Jensen effects and suggest that group differences on measures of phenotypic intelligence are more strongly caused by cultural differences than by g. We attempted to replicate Helms-Lorenz et al.’s finding of absence of Jensen effects on two large samples that took both culture-loaded and culture-reduced tests. We found proof of Jensen effects, a very large language bias effect in one specific intelligence test, and quite small cultural bias effects in the remaining intelligence tests. It appears that Helms-Lorenz et al.’s findings are an outlier among all empirical studies, which may be caused by the use of an unrepresentative set of tests.
1.1. Spearman’s hypothesis
Spearman’s hypothesis (Spearman, 1927) holds that Black–White differences in mean scores on intelligence tests are dependent on the g loading of the tests. Differences in scores between groups are supposed to be larger as the g loadings of tests are higher. Jensen (1992a) states that Spearman’s hypothesis is to be regarded as an empirical fact and:
This variation in the size of W-B mean differences has not been found to be systematically related to any of the surface features of psychometric tests, such as verbal or nonverbal, spatial or numerical, individual or group, paper-and-pencil or performance tests, specific knowledge content, or objective indices of test bias. (Jensen, 1993)
Jensen (1985) gives a review of the literature that confirms earlier conclusions. Two recent studies in an industrial psychology journal are in line with Spearman’s hypothesis. Goldstein, Yusko, Braverman, Smith, and Chung (1998) showed that group differences in scores on various assessment center exercises are largely attributable to their cognitive complexity. Roth, Bevier, Bobko, Switzer, and Tyler’s (2001) meta-analysis of ethnic group differences in cognitive ability in employment and educational settings show differences that are largest on the most g-loaded measures. These studies show variation in the size of the value of the correlation coefficient, with values as low as .20 or .30, but the finding of positive correlations is so ubiquitous and the number of low correlations so small that Jensen’s (1992a) statement that Spearman’s hypothesis is an empirical fact seems warranted.
1.2. Jensen effects
Rushton (1998) proposed that when a positive correlation occurs between g-loadedness and variable X, the result be termed a ‘‘Jensen effect,’’ because otherwise there is no name for it, only a long explanation of how the effect was achieved. Thus, the use of the term ‘‘Spearman’s hypothesis’’ may be restricted to research in the United States with Black and White groups and can be seen as a special case of the general Jensen effect. Jensen effects have now been found in South Africa, comparing various groups (Lynn & Holmshaw, 1990; Lynn & Owen, 1994; Nagoshi, Johnson, DeFries, Wilson, & Vandenberg, 1984; Rushton, 2001, 2002; Rushton & Skuy, 2000; Rushton, Skuy, & Fridjohn, 2002); studies comparing East Asian and Western children show more equivocal results (Jensen & Whang, 1993; Ja-Song & Lynn, 1992; Lynn & Shigehisa, 1991; Lynn, Chan, & Eysenck, 1991); Dutch studies comparing immigrants with Dutch show clear Jensen effects (te Nijenhuis, 1997; te Nijenhuis, Evers, & Mur, 2000; te Nijenhuis, Tolboom, Resing, & Bleichrodt, in press; te Nijenhuis & van der Flier, 1997, submitted). Of the 16 comparisons in all Dutch studies, only te Nijenhuis et al. (in press) report one comparison that results in a small, negative correlation (r = -.18) between g loadings and standardized group differences. Mean IQ scores of immigrants in the Netherlands are roughly one S.D. lower than those of the majority group (te Nijenhuis & van der Flier, 2001). As many West European countries resemble each other in that most of their immigrants come from Third World countries, including former colonies, it may be that the Dutch findings are generalizable to these West European countries. These findings of Jensen effects render the many explanations for mean differences in measures of phenotypic intelligence between these groups, which are put in terms of strong cultural differences between majority group members and immigrants, less plausible. The only clear, empirically supported biasing factor is proficiency in the majority language (te Nijenhuis & van der Flier, 1999; see also Lopez, 1997; Pennock-Román, 1992; Sandoval & Durán, 1998). Jensen (1992b, 1998) distinguishes between the constructs of g and vehicles for measuring g. It appears that when assessing immigrants using as vehicle a test with a strong verbal component, this vehicle reflects the degree of second-language acquisition more than it reflects the construct g. Thus, there is proof for test bias, but, generally, the effects on the total score of a test battery are not strong.
1.3. Cultural effects
Cross-cultural test research in Third World countries is quite relevant for the testing of non-native-born, non-majority-language-speaking immigrants. Cultural loading is the generic term for implicit and explicit references to a specific cultural context, usually the culture of the test author, in the testing instrument (Helms-Lorenz, van de Vijver, & Poortinga, 2003). van de Vijver and Poortinga (1992) distinguish five potential sources of cultural loadings:
the tester (e.g., when tester and testee are of a different cultural background);
the testees (e.g., intergroup differences in educational background, scholastic knowledge, and testwiseness);
tester–testee interaction (e.g., communication problems);
response procedures (e.g., differential familiarity with time limits in test procedures); cultural loadings in the stimuli (e.g., differential suitability of items for different groups due to stimulus familiarity).
Cross-cultural research has provided us with many examples of groups not possessing the specific testing skills presupposed by standard ability tests to a sufficient degree (Berry, Poortinga, Segall, & Dasen, 1992; Deregowski, 1979; Hudson, 1960, 1967). A meta-analysis (van de Vijver, 1997) shows small correlations between years of schooling and IQ scores for young children in developing countries. Of course, some tests are more widely applicable than others and less dependent on specific learning experiences. However, to what degree can these cross-cultural findings be generalized to populations of immigrants into Western countries? Important factors such as percentage of illiteracy and number of years of schooling clearly differ between developing countries and immigrant populations. It appears that the empirical findings on immigrants resemble more strongly those of absence of bias when comparing American Blacks and Whites than those of nonoptimal testing skills and underestimates of g from cross-cultural research.
In excellent, innovative papers Helms-Lorenz (2001) and Helms-Lorenz et al. (2003) challenge both the finding that in general tests show little bias and the overwhelming finding of Jensen effects in the Netherlands and suggest that group differences are more strongly caused by cultural differences than by g. It should be noted that with concern to the empirical tests for Jensen effects in the Netherlands they report only 4 of the 16 comparisons (of which 15 are Jensen effects) carried out to date. They attempt to show that Jensen effects are not empirical facts when testing immigrants, but that occurrence of Jensen effects depends on the absence of culture-loaded subtests in the battery used. The tests used are as follows: the RAKIT, a standardized intelligence test comparable to the WISC; the SON-R, a nonverbal test originally developed for testing deaf children; and two subtests of the TAART, a computer-assisted cognitive ability test. The TAART has been developed to assess simple cognitive processes, with little influence of cultural and linguistic knowledge, using comparisons of geometric forms on a computer screen. The two subtests measure Processing Speed (gp) and Perceptual Speed (gv), respectively, both of quite low cognitive complexity. The cultural loading of all tests was rated by 25 third-year psychology undergraduates, who had followed at least two courses in cross-cultural psychology. A rating of 0 meant no cultural loading and a rating of 5 meant very high cultural loading. The cultural loadings ranged from 1.21 (test of Induction using abstract figures, gf) to 4.03 (test of Lexical Knowledge, gf). They report a substantial negative correlation (r = -.41) between g loading and effect sizes for a sample of 12 cognitive tests on large but nonrepresentative samples of 6- to 12-year-old Dutch children (n = 747) and second-generation immigrant children (n = 474).
1.4. Research questions
Eysenck (1988) describes two different traditions in intelligence measurement. Galton suggested physiological measures and simple tests of sensory capacity and motor reactivity, such as reaction times. Binet, on the other hand, preferred tests involving problem solving, learning, memory, and evidence of past learning such as vocabulary and other types of verbal achievement. On a continuum from culture loaded to culture reduced (Jensen, 1980), tests in the Galtonian tradition can be considered as culture-reduced tests, whereas Binet-type tests can be considered as more culture-loaded tests. We decided to attempt to replicate Helms-Lorenz et al.’s finding of the absence of a Jensen effect, using two samples that took both culture-loaded and culture-reduced tests with a comparable range of cultural loadedness as in the Helms-Lorenz et al. study.
This paper tests whether group differences between immigrants and majority group members are best described as Jensen effects or as cultural effects. First, do two samples that took a combination of culture-reduced and culture-loaded tests show Jensen effects? Second, do Jensen effects show up when the collection of tests is divided into separate sets of culture-reduced and culture-loaded tests? Third, are culture-loaded tests more biased than culture-reduced tests, as reflected in the regression line of the culture-reduced tests being situated substantially below the regression line of the culture-loaded tests?
Two studies were carried out in which data of job applicants on both culture-reduced and culture-loaded tests were used.
2.1. Research participants
This article made use of test data on first-generation immigrants and majority group members who applied for blue-collar jobs at the Dutch Railways (Studies 1 and 2), jobs as bus driver for regional bus companies (Studies 1 and 2), and jobs as truck driver in road transport and haulage (Study 2).
The first study made use of test data on the complete population of first-generation immigrants who applied for blue-collar jobs at the Dutch Railways and regional bus companies in the Netherlands from 1988 until 1992. The application process included a psychological examination, which took place at the Work Conditions Service Unit of the Dutch Railways in 10 centers throughout the Netherlands. The immigrant sample consisted of Turks (n = 217), North Africans (n = 103), Surinamese (n = 370), and Antilleans (n = 96). The data from Turks and North Africans, and Surinamese and Antilleans, were clustered. Turks and North Africans come from Muslim cultures and do not speak Dutch as their mother language, whereas Antilleans and Surinamese come from colonies and former colonies in South America and the Caribbean, respectively, so they are familiar with the Dutch language through education. The theoretical considerations for clustering are supported by empirical outcomes: Intercorrelations between subtests, mean scores, and amount of bias in the tests are highly comparable between the groups in each cluster (te Nijenhuis & van der Flier, 1997, submitted).
Within all 10 centers we randomly drew a sample from the complete local population of Dutch applicants applying for the same jobs as the immigrant applicants, in such a way that we had equal percentages as in the immigrant group. This resulted in a sample (n = 584) from the majority group with a distribution with respect to jobs and regions that was close to that in the complete immigrant group (see te Nijenhuis & van der Flier, 1997, submitted, for all details on jobs applied for and demographics).
In the second study, the immigrant sample comprised the complete population of first-generation immigrant job applicants from 1994 to 1996 that were tested with the computerized versions of so-called safety aptitude tests. They consisted mainly of Turks, North Africans, Surinamese and Antilleans (n = 267). A sample (n = 283) was selected from the Dutch group following the same standard procedures as in Study 1 (see te Nijenhuis, 1997, for details).
2.2.1. Attention Diagnostic Method and Computerized ADM
The Attention Diagnostic Method (ADM; Rutten & Block, 1976) and the Computerized ADM (CADM; van Drie & Schoonman, 1993) measure attention disorders. The ADM consists of a series of numbers that are placed in random order on a screen. In Part 1, the candidate has to look up the numbers in the order from 11 to 59. When the number is found, the candidate has to give its value and color. In Part 2, there are smaller sized numbers under the main numbers. Again, the candidate has to look up the numbers in the order from 11 to 59. When the number is found, the candidate has to give the value of the smaller sized number and the color of the main number. The traditional version uses a fluorescent board with numbers and is administered in a darkened room. In the computerized version, the screen with numbers has been turned 90° and the order of the numbers is different. In the traditional version the responses are oral, whereas in the computerized version the responses are given by pressing a button. In the computerized version the candidate responds only by pressing a colored button and does not have to give the value of the number anymore. In both parts of the traditional ADM, both the total time needed in seconds is registered and the number of mistakes. Two large-scale studies found comparability of dimensions and reliabilities, no language bias, and no group differences in meaning; thus it was concluded that there was no bias in either version (te Nijenhuis, 1997; te Nijenhuis & van der Flier, 2002).
2.2.2. Determinations Gerät and Computerized DTG
The Determinations Gerät (DTG) [Determination apparatus] and the Computerized DTG (CDTG) are perceptual motor tests. They measure the ability to react, sensorimotor coordination ability, and precision of reactions. In nonsystematic order, visual and acoustic stimuli are presented to which specific reactions must be given. The visual stimuli are presented on a screen; the acoustic stimuli are presented over a headphone. The reactions consist of pressing buttons on the reaction screen with the fingers and using pedals with the feet. The visual stimuli consist of five differently colored lights that appear on different places on the screen. A correct reaction consists of pressing the button with the same color. Two other visual stimuli are fixed yellow lamps on the left and right sides of the screen for the DTG and two yellow stimuli for the CDTG. A correct reaction consists of pressing the left or the right pedal. The acoustic stimuli consist of low and high tones. A correct reaction consists of using the left or the right black buttons on the reaction screen.
The test consists of three time-driven parts, in which the intervals between the stimuli are 1.1, 1.0, and 0.8 s, and two reaction-driven parts, each 150 s. Four performance measures are registered: number of correct, number late, number of mistakes, and number of omissions (van Drie & Schoonman, 1993). The traditional version consists of a mechanical apparatus, whereas the computerized version consists of an operating panel and a computer screen. The assignment remains unchanged, although some alterations were made in the computerized version with regard to the layout of the screen and the colored buttons on the operating panel. Two large-scale studies found comparability of dimensions and reliabilities, no language bias, and no group differences in meaning; thus it was concluded that there was no bias in either version (te Nijenhuis, 1997; te Nijenhuis & van der Flier, submitted). For both safety-aptitude tests, the most relevant variables were taken (see te Nijenhuis & van der Flier, 2002).
2.2.3. General Aptitude Test Battery
The General Aptitude Test Battery (GATB) is a general intelligence test and comprises eight subtests: Three Dimensional Space, Vocabulary, Arithmetic Reason, Computation, Tool Matching, Form Matching, Name Comparison, and Mark Making. Apart from general intelligence, the subtests measure the broad abilities Fluid Intelligence, Crystallized Intelligence, Broad Visual Perception and General Psychomotor Speed. In the Vocabulary subtest, test takers are presented with four words and have to find two words that have the same or opposite meaning. The test measures both Induction (gf) and Lexical Knowledge (gcr); for immigrants with nonoptimal proficiency in Dutch the subtest acts more as a measure of gcr than of gf (te Nijenhuis & van der Flier, 1997).
Following Helms-Lorenz et al. (2003), GATB subtests were taken as measures of culture-loaded tests, whereas safety aptitude tests were taken as measures of culture-reduced tests. Our sample of tests probably has a broader range of cultural loadings than the Helms-Lorenz et al. sample of tests.
2.3. Statistical analyses
Jensen (1993) states that seven methodological requirements for the testing of Spearman’s hypothesis have to be met:
1. The samples should not be selected on any highly g loaded criteria.
2. The variables should have reliable variation in their g loadings.
3. The variables should measure the same latent traits in all groups. The congruence coefficient of the factor structure should have a value of > .85.
4. The variables should measure the same g in the different groups; the congruence coefficient of the g values should be >.95.
5. The g loadings of the variables should be determined separately in each group. If the congruence coefficient indicates a high degree of similarity, the g loadings of the different groups should be averaged.
6. To rule out the possibility that the correlation between the vector of g loadings (Vg) and the vector of mean differences between the groups, or effect sizes (VES) is strongly influenced by the variables’ differing reliability coefficients, Vg and VES should be corrected for attenuation by dividing each value by the square root of its reliability.
7. The test of Spearman’s hypothesis is the Pearson correlation (r) between Vg and VES. To test the statistical significance of r, Spearman’s rank order correlation (rs) should be computed and tested for significance.
The g loadings for the GATB were computed, using the first unrotated factor of a principal axis factor analysis (Jensen & Weng, 1994). Because of the limited sampling of broad abilities of the GATB, it is not optimal for a precise and theoretically sound estimate of g loadings. The best estimate of the g loadings was found in a factor analytic study of the Dutch version of the GATB 1002 A with a large number of other tests, using the first unrotated factor of a principal axis factor analysis (Dutch GATB Manual; van der Flier & Boomsma-Suerink, 1994, p. 51). These estimated values of the g loadings were used for the correlation of Vg and VES. This procedure departs somewhat from Jensen’s fifth requirement, but here it seems preferable. The g loadings of the safety aptitude tests were computed by correlating the g score of the GATB with the scores on the safety aptitude tests. For more details concerning the testing for Jensen effects we refer to previous studies (te Nijenhuis, 1997; te Nijenhuis & van der Flier, 1997, submitted).
In both studies a series of stepwise analyses was carried out, with the outcomes of one step determining whether to carry out a subsequent step. To check whether the correlations between Vg and VES were due to g the regression of the standardized mean group differences (d) on the estimated g loadings was computed. This resulted in correlations between d and g and their regression equation for, consecutively, (1) the combination of both the GATB and the safety aptitude tests, (2) the subtests of the GATB, (3) the GATB without the language-bias-sensitive subtest Vocabulary, and (4) the various scores on the safety tests. Concerning the safety aptitude tests, both sets of experimentally independent measures (based on independent observations) and experimentally dependent measures were used. Finally, the regression lines of the GATB were compared with the extrapolated regression lines of the safety tests. When the regression line for the culture-loaded tests is higher than the regression line for the culture-reduced tests, this strongly suggests that there is bias in the culture-loaded tests; culture-loaded tests with equal g loadings as culture-reduced tests would then show a larger difference between two groups. Using the regression equations the amount of bias was estimated by computing how many S.D. units the regression lines differed. Differences were computed for g = .50, .60, and .70, respectively, because these are the most common g loadings of the GATB subtests. Because of the small number of data points, it was decided to report only effect sizes and not to test for statistical significance. However, the data points are based on quite large samples.
In the first study, Jensen’s methodological requirements for the testing of Jensen effects were met for the GATB. The research participants in this study varied from train cleaner to rail maintenance expert and were not selected on any highly g loaded criteria, so there is no indication that the g variance in the samples is markedly restricted. The best estimate of the g loadings was found in a factor analytic study of the Dutch version of the GATB 1002 A with a large number of other tests, using the first unrotated factor of a principal axis factor analysis (Dutch GATB Manual; van der Flier & Boomsma-Suerink, 1994, p. 51). The subtests have reliable variation in their estimated g loadings, with a range from .14 to .68. The third requirement is that the tests should measure the same latent traits in the various groups; the hierarchical structural equations models in te Nijenhuis and van der Flier (1997) show that this was the case. A comparison of the empirical g loadings of the majority group with the empirical g loadings of the immigrant groups resulted in values of the congruence coefficient that varied between .978 and .995. The empirical g loadings are therefore highly comparable in the different groups. The estimated values of the g loadings were used for the correlation of Vg and VES. This procedure departs somewhat from Jensen’s fifth requirement, but here it seems preferable. Each value in Vg and VES was corrected for attenuation (see te Nijenhuis & van der Flier, 1997, for a detailed description of the procedure), and the correlation between the disattenuated vectors was computed.
In the first study, Jensen’s methodological requirements were also met for the traditional safety tests. From the experimentally independent variables a choice was made for ADM1TIME, ADM2TIME, DTG1.1TD number of correct, DTG1.0TD number of mistakes, DTG0.8TG number of omissions, DTG1RG number of correct, DTG2RG number of mistakes, because this set of variables seems a good representation of the ADM and the DTG. Correlational analyses showed that the variables measured the same latent traits in all groups. The variation in g loadings might be called good, that is, .14 to .56 for the majority group and .19 to .59 for the immigrants. The g loadings are highly comparable for both groups, so they may be averaged for the testing of the two groups combined. The standardized mean score differences were divided by the root of the reliability for each variable. As several test–retest reliabilities were below .60, correcting for unreliability would lead to overcorrections. A detailed description of all analyses is given in te Nijenhuis and van der Flier (submitted).
In the second study, Jensen’s methodological requirements were met for the GATB and the computerized safety tests. The immigrant sample and the Dutch sample strongly resemble the samples in Study 1. As expected, the findings for the GATB strongly resemble the findings reported above, so they are not reported here. To test for Jensen effects on the computerized safety aptitude tests, first the g loadings of the safety aptitude tests have to be determined. They were computed by correlating the g score of the GATB with the scores on the safety aptitude tests. The groups included in the second study were not selected based on criteria that could limit their variance on g.
For the experimentally dependent variables, aggregated variables were computed for the CADM and the CDTG: The z scores of all three or five comparable test parts were added up. This resulted in the following variables: CADM time, CADM mistakes, CADM dips, CDTG correct, CDTG mistakes, CDTG omissions, and CDTG late. The variation in g loadings might be called good, that is .28 to .57 for the majority group and .15 to .60 for the immigrants. The g loadings are generally somewhat higher for the immigrants. A comparison of the unrotated factor solutions for the two groups resulted in a value of the congruence coefficient (rc) of .99. The rc for the g loadings of the two groups is .99. The g loadings are highly comparable for both groups, so they may be averaged for the testing of the two groups combined. The g loadings and the standardized score differences were corrected for unreliability (see te Nijenhuis, 1997, for a detailed description of the procedure). The reliabilities for majority group members and the immigrants have an rc of .99. When testing the hypothesis for the complete group, the reliability coefficients were averaged.
For the experimentally independent variables a choice was made for CADM1TIME, CADM2 number of mistakes, CDTG1.1TD number of correct, CDTG1.0TD number of mistakes, CDTG0.8TG number of omissions, CDTG1RG number of correct, CDTG2RG number of mistakes, because this set of variables seems a good representation of the CADM and the CDTG. The variation in g loadings might be called good, that is, .18 to .53 for the majority group and .09 to .58 for the immigrants. The g loadings are generally somewhat higher for the immigrants. A comparison of the unrotated factor solutions for the two groups resulted in a value of the congruence coefficient (rc) of .96 The rc for the g loadings of the two groups is .97. The g loadings are highly comparable for both groups, so they may be averaged for the testing of the two groups combined. The g loadings and the standardized score differences were corrected for unreliability (see te Nijenhuis, 1997, for a detailed description of the procedure). The reliabilities for majority group members and the immigrants have an rc of .99. When testing the hypothesis for the complete group, the reliability coefficients were averaged.
3.1. Study 1
Table 1 shows that for the Turks and North Africans on the combination of the traditional safety aptitude tests and the GATB, d and g correlated .78 (ES = 2.09g-.18); on the GATB, d and g correlated .75 (ES = 2.22g-.19); on the GATB with the subtest Vocabulary left out, d and g correlated .84 (ES = 1.67g-.02); on the traditional safety aptitude tests d and g correlated .70 (ES = 1.35g+.04). To estimate the effects of cultural bias, several outcomes were compared. To estimate the bias in the subtest Vocabulary we compared the empirical d with the extrapolated regression line of the culture-reduced tests at g = .68, as this is the g loading of this specific subtest. The difference was 1.08 S.D. for traditional safety tests and .83 S.D. and .86 S.D. for experimentally dependent and experimentally independent computerized safety tests, respectively. These biasing effects are so strong, they substantially influence the steepness of the regression lines. Therefore, when estimating the bias in the remaining culture-loaded subtests, we decided to compare the regression line of the safety tests with the regression line of the GATB without the Vocabulary subtest. In this way, the extreme amount of cultural bias in Vocabulary will not influence the assessment of bias in the other seven subtests. The regression line for the culture-loaded tests is somewhat above the extrapolated regression line of the culture-reduced tests; Table 2 shows that the difference between the regression lines is between .10 S.D. (g = .50) and .16 S.D. (g = .70).
Table 1 shows that for the group of Surinamese and Antilleans on the combination of the traditional safety aptitude tests and the GATB, d and g correlated .60 (ES = 1.04g+.15); on the GATB, d and g correlated .67 (ES = 1.33g-.02); on the traditional safety aptitude tests, d and g correlated .22 (ES = .38g+.39). The value of .60 for the combination of tests and .67 for the GATB show that clear Jensen effects are present. The value r = .60 is based on 15 data points and therefore appears to be quite robust. How then do we explain the low, positive correlation for the safety tests? As this mixed group consists predominantly of Surinamese, and as they speak Dutch fluently, language bias is implausible. Therefore, there has to be another reason for the low correlation. As there are only a small number of data points, the correlation is sensitive to outliers. These outliers may be comparable to those found in tests for Jensen effects in groups of East Asians: group differences on broad or even narrow abilities that appear to have nothing to do with cultural bias. We therefore decided not to compare the regression lines.
3.2. Study 2
On the combination of the computerized safety aptitude tests and the GATB, d and g correlated .83 (ES = 1.94g-.41); on the GATB, d and g correlated .81 (ES = 2.20g-.54); on the GATB with the subtest Vocabulary left out, d and g correlated .88 (ES = 1.81g-.41); on the computerized safety aptitude tests, d and g for the experimentally independent variables correlated .62 (ES = 1.08g-.09), and for the experimentally dependent variables d and g correlated .71 (ES = 1.21g-.15). The regression line for the culture-loaded tests is somewhat above the extrapolated regression line of the culture-reduced tests, both for experimentally dependent and experimentally independent variables; Table 2 shows that the difference between the regression lines is between .04/.05 S.D. (g = .50) and .16/.19 S.D. (g = .70). These values are highly comparable to the values from Study 1.
Thus, both Study 1 and Study 2 lead to the conclusion that in virtually every comparison clear Jensen effects are found, the effects of cultural bias in Vocabulary are very large, and the biasing effects in the other culture-loaded tests are quite small. Averaging all the bias estimates leads to a mean value of .92 S.D. for Vocabulary and a mean value of .12 S.D. for the other seven subtests. Based on Vocabulary, this suggests that the GATB underestimates g with .12 S.D. (1.8 IQ points). Based on the other seven subtests, this suggests that the GATB underestimates g with 7 x .12 S.D./8 = .10 S.D. (1.5 IQ points). All in all the GATB appears to underestimate immigrants’ g with .22 S.D. (3.3 IQ points), a clear, but small effect.
When testing immigrants, in general there is little test bias and in virtually all studies one finds Jensen effects. These findings, however, are challenged by Helms-Lorenz et al. (2003), who suggest that differences between immigrants and majority groups are more strongly caused by cultural differences than by g. We attempted to replicate Helms-Lorenz et al.’s finding of absence of Jensen effects on two samples that took both culture-loaded and culture-reduced tests but we found very clear Jensen effects. Clear Jensen effects also showed up in virtually all comparisons when the collection of tests was divided into separate sets of culture-reduced and culture-loaded tests. Cultural bias in the subtest Vocabulary is very large, which is best explained by the low level of Dutch language proficiency of immigrants. The regression line of the culture-reduced tests was below the regression line of the culture-loaded tests, but the effects are on average quite small, showing that there is a clear but small amount of cultural bias in the culture-loaded tests. Thus, there is proof of Jensen effects, a very large bias effect in one specific test, and quite small bias effects in the remaining tests. These cultural bias effects are most likely related to language proficiency. The data suggest that the GATB underestimates g only with 3.3 IQ points.
4.1. Limitations of this study
The correlations used when testing for Jensen effects are sensitive to outliers, especially when the number of data points is small, as is usually the case. The resulting values are not robust and should be treated with caution. Language bias in subtests is not always easy to detect, because in some studies various groups of immigrants, both with good and nonoptimal language skills, for instance, first- and second-generation immigrants, were combined. This combination of various subgroups may also override other cultural variables in one specific subgroup. As the GATB subtests are used for the computation of the g score and the safety tests are not, this may lead to an underestimate of the difference between the regression lines.
4.2. Cultural bias
Following van de Vijver and Poortinga (1992), we used cultural bias as a generic term for references to a specific cultural context in the testing instrument; the most important cultural bias is a low level of proficiency in the language of the test (Lopez, 1997; Pennock-Román, 1992; Sandoval & Durán, 1998; te Nijenhuis & van der Flier, 1999). This study is in line with findings from other studies on cultural bias: Standardized IQ tests can be used quite well, though not perfectly, for the assessment of immigrants; standardized tests give in general a clear but small underestimate of g; tests with a substantial language component give a strong underestimate of g for immigrants with nonoptimal non-native-language proficiency; and tests without a substantial language component do not lead to a large underestimate of g. From a practical point of view, one might suggest to attempt to correct for this underestimate and use this corrected g for predictive purposes. However, as there are no strong indications of differential prediction using short-term criteria (te Nijenhuis & van der Flier, 1999) this would result in overpredicting short-term school results, training results, and job performance. There are no published data on prediction of long-term criteria. It appears that immigrants have to work harder for a living than their Dutch colleagues with the same level of g.
van de Vijver and Poortinga (1992) argue that cultural loadings need not be detrimental. The (un)desirability of cultural loadings in measurement procedures is determined by the intended use of the test in question, the generalizations envisaged from the scores. The elimination of biased items does not necessarily imply an increase in the predictive validity of the test, because future school and work achievement itself may have a high cultural loading. In this case, cultural loading is unavoidable and even desirable. Indeed, leaving out language-bias-sensitive subtests may result in substantially reduced criterion-related validity (see te Nijenhuis et al., 2000). The size of the underestimate of g depends most strongly on the number of language-bias sensitive tests. This may explain the .30 S.D. difference between the RAKIT and the SON-R in the Helms-Lorenz et al. study, the SON-R consisting of only nonverbal tests.
4.3. Limitations of the Helms-Lorenz et al. study
Our findings are based on first-generation adult immigrants, arguably the group with the most difficulties on the labor market. Is seems plausible that for future generations less bias will be found. Helms-Lorenz et al.’s sample consists of second-generation children – the large majority of parents were born outside the Netherlands. However, about 75% of their sample are Turkish and Moroccan, and usually young children in these groups communicate with their parents in another language than Dutch. It is usually found that children from these groups have a 2-year lag with regard to Dutch language proficiency. Thus, their language proficiency may be comparable to that of first-generation immigrants.
The test for Jensen effects on the combined sample of culture-reduced and culture-loaded tests is a direct replication of Helms-Lorenz et al.’s finding of the absence of Jensen effects. Our test is even better, because the applicants in the two samples took all tests, whereas Helms-Lorenz et al.’s data on the RAKIT and the TAART on the one hand, and the SON-R and the TAART on the other hand, are from different persons.
Jensen (1998) states that several requirements for the testing of the hypothesis of the link between group differences and g have to be met. One of those is that the samples have to be representative of their respective populations. Helms-Lorenz et al. did not use representative sampling, but convenience sampling, which is clearly reflected in the extreme variations in effect sizes between the six age groups: Effect sizes for the RAKIT subtest Exclusion range from -.92 to .39 (mean of -.26) and the effect sizes for the SON-R subtest Situations range from -.82 to .02 (mean of -.36). Although Helms-Lorenz et al. used students from several schools, their means should be treated with caution. In contrast, our immigrant samples constitute all immigrant job applicants for jobs at Dutch Railways in a specific time, and the majority group samples are careful, representative matches of the complete immigrant group of job applicants. This does not mean that our sample is representative of the population of immigrants. There are, for instance, no highly educated persons in our sample; the range in IQs is larger in the Helms-Lorenz et al. study. However, the level of the jobs applied for is comparable to the level of the jobs most immigrants apply for.
4.4. Different estimates of bias
An important point is that for the majority of the subtests the estimates of bias in this study, using regression lines (about .12 S.D.), are roughly twice the size as those from item bias studies (about .05 S.D.) (te Nijenhuis & van der Flier, 1997). However, one should not forget that the bias effects are small in both cases; thus, one could state that both estimates are roughly comparable.
The differences between estimates of bias in Vocabulary in this study and in item bias studies (te Nijenhuis & van der Flier, 1997) can only be called extreme. It could be hypothesized that most item bias detection techniques fail to identify bias that is systematically present in every item of a subtest, although the amount of bias does not have to be the same for every item. The GATB is a highly speeded test and most item bias techniques pick up only group differences in power and not in speed (Holland & Wainer, 1993; Millsap & Everson, 1993).
4.5. Alternative explanation: the influence of broad and narrow abilities and cultural bias
The strong form of Spearman’s hypothesis states that group differences in test scores are solely attributable to g, whereas the weak form states that they are also influenced by group differences in broad and narrow abilities. When matched on IQ, Blacks score higher on several broad abilities and on specific narrow abilities, such as short-term memory, but they score lower on broad visual perception (Reynolds & Jensen, 1983). Jensen and Weng (1994) stated that the goodness of the g extracted from a set of tests depends on, among other things, (a) the number of tests, (b) the number of different mental abilities represented by the various tests, and (c) the degree to which the different types of tests are equally represented in the set. The g factor varies across different sets of tests to the extent that the sets depart from these criteria. Departing from these criteria may influence the correlations in tests of Spearman’s hypothesis: It may result in low or even negative correlations.
A visual inspection by the present authors of all 16 figures from empirical studies of Jensen effects in the Netherlands suggests that immigrants and Dutch, when matched on IQ, differ on the Broad Abilities Memory and Broad Visual Perception. In addition, the presence of tests with a strong cultural bias, such as the RAKIT’s subtest Word Meaning, may lower correlations. It is remarkable that the only Dutch study where previously a Jensen effect was absent used the RAKIT (te Nijenhuis et al., in press) and that this very RAKIT is also used in the Helms-Lorenz et al. study. It is striking that at least 7 of the 12 tests used by Helms-Lorenz et al. fall in the abovementioned categories: Word Meaning is the most clear language-bias-sensitive test; Learning Names measures Memory; Discs, Hidden Figures, Mosaics, Situations, and ECT2 all measure Broad Visual Perception. This calls into question the validity of Helms-Lorenz et al.’s regression line. Helms-Lorenz et al.’s attempt to show that occurrence of Jensen effects depends on the absence of cultureloaded subtests in the battery used may be biased, because they use an unrepresentative set of tests.
Jensen effects are found in virtually all Dutch studies. It appears that Helms-Lorenz et al.’s finding of absence of a positive correlation between g loadings and standardized group differences on measures of phenotypic intelligence is an outlier among all empirical Dutch studies. It is suggested that this absence of a Jensen effects is caused by the unrepresentativeness of the sampling of broad abilities.
Clear Jensen effects were found in virtually all comparisons in this paper. Little cultural bias was found, with the exception of strong bias in a verbal test, leading to a clear, but small underestimate of g of about 3,3 IQ points. So, Jensen effects and cultural effects are both present, but Jensen effects are much stronger in explaining group differences.