**Spearman’s hypothesis tested with Raven’s Progressive Matrices: A psychometric meta-analysis**

Jasper Repko, master thesis, 2011.

**Method**

Instruments

There are four versions of the Raven’s: the Standard Progressive Matrices (SPM) for the ages of 6 years to adulthood; the Colored Progressive Matrices, an easier version of the test designed for children aged 5 through 12; the Advanced Progressive Matrices (APM), a harder version of the test designed for older adolescents and adults with higher ability; and the Standard Progressive Matrices Plus (SPM plus), an extended version of the SPM offering more discrimination among more able young adults.

The SPM consists of 60 diagrammatic puzzles, each with a missing part that the test taker attempts to identify from several options. The 60 puzzles are divided into five sets (A, B, C, D, and E) of 12 items each. To ensure sustained interest and freedom from fatigue, each problem is boldly presented, accurately drawn, and, as far as possible, pleasing to look at. No time limit is set and all testees are allowed to complete the test. As an untimed capacity test, and even as a 20-min speed or efficiency test, the SPM is usually regarded as a good measure of the nonverbal component of general intelligence rather than of culturally specific information. The total score is also a very good measure of g, the general factor of intelligence, at least within Western countries (Jensen, 1980).

The Colored Progressive Matrices (CPM) is designed for young children, elderly, and people with moderate or severe learning difficulties. The test consists of 36 items. The first 12 items are the same as the first 12 items from the SPM (set A); the following 12 items are specially designed for the CPM. The last 12 items are again items from the SPM, this time items 13-24 (set B). Most of the items are presented on a colored background in order to keep the attention of the participants. The last couple of items from the test are, however, presented in black and white. If the participants manage to complete the test without too much difficulty they can continue without any problems with sets C, D, and E of the SPM.

The APM is a more difficult version of the Raven Progressive Matrices. The advanced form of the Matrices contains 48 diagrammatic puzzles, presented as one set of 12 (set 1), and another of 36 (set 2). Set 1 serves as an introduction, and set 2 is used to test the participants and compare scores. Items are presented in black ink on a white background, and become increasingly difficult as progress is made through each set. These items are appropriate for adults and adolescents of above average intelligence. Whenever we refer to the APM in this paper, we refer to set 2 of the APM.

The SPM Plus is the extended version of the SPM and is meant to restore the discriminative ability of the test (Raven, 1998). The SPM Plus was introduced as a revised version of the SPM. It consists of new items, most of which have been equated to match the old items in difficulty, but some are more difficult than any that appear on the SPM.

Specific Criteria for Inclusion

Five specific decision rules for selecting relevant studies for this meta-analysis were used. At first, only studies reporting a test of Spearman’s hypothesis using the Raven’s Progressive Matrices, computing a correlation between standardized group differences and g loadings, were included in the analysis however. Second, because we knew at the outset there were only few studies explicitly testing Spearman’s hypothesis, we also included studies reporting the proportion of a sample selecting the correct answer on the items (d), and studies reporting correlation between the item score and the total score (g), and studies reporting both d and g. Aside from the published studies, either on Spearman’s hypothesis or reporting usable data, we also used data sets provided to us by various well-known authors. To make sure that the same methods were applied to all datasets in the meta-analysis, we made use of correlations that we computed ourselves instead of the correlations reported in the original articles. Our choice of decision rules imply that many studies using the Raven’s could not be used because the data were not reported at the item level: when participants are tested on the Standard Progressive Matrices, the Colored Progressive Matrices, or the Advanced Progressive Matrices, in the large majority of cases only the percentile scores and group averages are reported.

Third, following the manual of the APM (Raven, 1998) and Rushton, we decided to exclude studies from our meta-analysis where the timed version of the APM was taken under time constraints with the time limit set to less than 30 minutes. The outcomes of the White sample of the Vigneau and Bors (2005) study does not correlate well with the outcomes of the studies using representative White groups. However, the time limit was set to 30 minutes and therefore we included the study in our analysis. We hypothesized that this dataset would show up as extreme outlier in the meta-analysis, so it would be left out as the meta-analysis advanced. Unfortunately we were unable to include a large sample from the University of Amsterdam (UvA) in our analysis. First-year psychology students at the UvA are obliged to take a series of tests and questionnaires. One of the tests that is taken is Raven’s APM. We were allowed access to the last ten years of test data. Given the fact that every year approximately 450 students take these tests we had access to a sample of about 4500 students. However, the time-limit was set to 22 minutes so the students were unable to try all of the 36 items which yielded incomplete data. None of our other datasets included a group where the time-limit was also set to 22 minutes, so we were unable to match the group of first-year psychology students and this rendered the data unusable.

Fourth, we decided to exclude the White German sample from the Raven Advanced Progressive Matrices manual (Court & Raven, 1982), the White sample from Vejleskov (1968), the White sample from Moran (1986), and the White sample from Forbes (1964) because we did not have comparable samples of either Blacks or Indians to match these White samples with. Furthermore we decided to exclude the data of the g loadings from the German sample from all our analyses (Raven, 1982) because these g loadings show weak correlations with all the other available g loadings. In contrast, all the other available g loadings without exception correlate highly with each other. We speculate that the g loadings of the White German sample were calculated with a different technique from the one that we are using. The values of d of the German sample (Court & Raven, 1982) do seem usable, but since we did not have a group to match them with we did not use this part of the data either.

Fifth, according to John Raven (personal communication, 2010) there appears to be a ceiling effect taking places in some samples taking Raven’s Standard Progressive Matrices. Ceiling effects happen when the assignment is too easy, so most people get the answer right, resulting in many perfect scores and reduced variance. A ceiling effect makes it difficult to assess the ability of the people who took the assignment and because of that it is hard to discriminate between these people. The dataset of Lynn, Allik, and Irwing (2004) appears to show such a ceiling effect for all of their age categories ranged 12 to 18, meaning that we could not compare different age categories of this sample with each other to calculate reliabilities for both d and g. We were however able to use the dataset to compute score differences for multiple age categories between Blacks and Whites. We found that as soon as we matched corresponding age categories with each other it provided useable results.

Correction for Deviation from Perfect Construct Validity

Te Nijenhuis and Dragt (2010) state that the deviation from perfect construct validity in g attenuates the values of r (g × d). In making up any collection of cognitive tests, we do not have a perfectly representative sample of the entire universe of all possible cognitive tests. Therefore any one limited sample of tests will not yield exactly the same g as another such sample. The sample values of g are affected by psychometric sampling error, but the fact that g is very substantially correlated across different test batteries implies that the differing obtained values of g can all be interpreted as estimates of a “true” g. The values of r (g × d) are attenuated by psychometric sampling error in each of the batteries from which a g factor has been extracted (te Nijenhuis & Dragt, 2010).

The more tests and the higher their g loadings, the higher the g saturation of the composite score is. The Wechsler tests have a large number of subtests with quite high g loadings, yielding a highly g-saturated composite score. Jensen (1998, p. 90– 91) states that the g score of the Wechsler tests correlates more than .95 with the tests’ IQ score. However, shorter batteries with a substantial number of tests with lower g loadings will lead to a composite with somewhat lower g saturation. Jensen (1998. ch. 10) states that the average g loading of an IQ score as measured by various standard IQ tests lies in the +.80s. When this value is taken as an indication of the degree to which an IQ score is a reflection of “true” g, it can be estimated that a tests’ g score correlates about .85 with “true” g. As g loadings represent the correlations of tests with the g score, it is most likely that most empirical g loadings will underestimate “true” g loadings; therefore, empirical g loadings correlate about .85 with “true” g loadings. As the Schmidt and Le (2004) computer program only includes corrections for the first four artifacts, the correction for deviation from perfect construct validity has to be carried out on the values of r (g × d) after correction for the first four artifacts (te Nijenhuis & Dragt, 2010).

Previous studies (te Nijenhuis & Franssen, 2010; te Nijenhuis, & Jongeneel-Grimen, 2007; te Nijenhuis, de Pater, van Bloois, & Geutjes, 2009; te Nijenhuis & van der Flier, submitted; te Nijenhuis, van Vianen, & van der Flier, 2007) used a conservative value of .90 as a basis to limit the risk of overcorrection. However, te Nijenhuis and Dragt (2010) computed a g score based on 24 subtests and various g scores based on combinations of eleven subtests (the most common number of subtests in a battery); this yielded an average correlation between their estimate of “true g” and the other g’s of 0.925. The new method for computing the correction for imperfectly measuring g from te Nijenhuis and Dragt (2010) shows that the correction used before was too strong. In all previous studies a correction of ten percent was applied to compute rho-5, but in this paper te Nijenhuis and Dragt (2010) rounded of the value for the correction of the fifth artifact to 7.5 % for the computation of rho-5. However, te Nijenhuis and Dragt (2010) worked on test batteries, whereas in the present meta-analysis the focus is on the Raven’s. Based on our reading of the literature – for instance the dataset from the Minnesota Twin Study by Bouchard (Bouchard, Lykken, McGue, Segal, & Tellegen, 1990) – we estimate that the total score on the Raven’s has a g loading of about .75. We remind the reader that the correlation of the item score with the Raven’s total score is an estimate of an item’s g loading. So this suggest that for research at the item level a correction of 7.5% has to be combined with a correction of 25%, yielding a total correction of 32.5%.

Estimating g

The total score on the Raven’s is a good measure of g, so the item-total correlation gives an estimate of the g loadings of the items on the test (Jensen & Weng, 1994). In this paper we used both the g loadings provided by the authors in the various papers, and the g loadings that we computed ourselves using data supplements. It was impossible to compute the g scores of every available study because the necessary data were not reported in many of the articles or the data supplements used in our study, so in many cases we had to use the g scores that were calculated by the authors of the articles.

Correlations between various mental tests range from slightly greater than 0 to slightly less than 1, but they are always positive except for sampling error or statistical artifacts (Jensen, 1987a). Until someone can devise a cognitive test that has a true negative correlation with other mental tests, which no one has yet succeeded in doing, it can be accepted as a fact that all types of mental tests are positively correlated (Jensen, 1992). So, when negative g loadings of items were reported in the articles that have been used for this meta-analysis, we did not exclude these items, but we changed the value of the g loadings to .00.

We used in the large majority of the cases the g loadings of the White group. Exceptions were made for the following cases: the g of the Roma was used to compute the correlation between the Roma sample of Rushton, Čvorović, and Bons (2007) and the Black sample of Rushton and Skuy (2000). For the correlation between the Whites of Lynn, Allik, and Irwing (2004) and the Blacks of Rushton and Skuy (2000) we used the g of the White sample from Rushton and Skuy (2000). For the correlation between the Whites of Lynn, Allik, and Irwing (2004) and the Blacks of Rushton, Skuy, and Fridjhon (2002) we used the g of the White sample from Rushton, Skuy, and Fridjhon (2002). The g of the Whites from Rushton (2002) was used to compute the correlation between the Whites of Lynn, Allik, and Irwing (2004) and the Blacks of Rushton (2002).

A general aim of the meta-analysis was to get the best estimates of d and g as possible. A general principle of psychometric meta-analysis is that combining a substantial number of datasets reduces error and increases the quality of the estimates. We therefore computed an aggregated g for, respectively, the Standard Progressive Matrices and the Advanced Progressive Matrices by combining all the available data of these tests. However, the g loadings of teenagers and the g loadings of adults yielded a low correlation, so we concluded that they were not comparable. For the SPM we therefore decided to compute both an aggregated g loading for groups with a mean age of 14, and one for groups with the lower bound of the age range being 18 years. For the APM the only data that were available to us were of adults, so we only computed one aggregated g loading for this measurement instrument. The three different aggregated gs based on several datasets were used to compute the correlation between d and g, in addition to the gs that were based upon individual datasets. So for every d we calculated two correlations, one with the g of the White group (unless as stated we used a different g), and one with the matched aggregated g.

The weighted average g loadings were computed, matching the age range of the participants to the age range of the g loadings as close as possible. The only exception was the g of the Roma taking the SPM (Rushton, Čvorović, & Bons, 2007), a group with a minimum age of 17. This g was found to be incomparable to other samples with an age of above 17, but highly comparable to the g of the group with an average age of 14 years old. Therefore we decided to add the g of the Roma to the SPM group with an average age of 14 years old. Aggregated g loadings were obtained by combining all the available g loadings to form one aggregated g for every age group. This was done by multiplying the g loadings of every group by the total participants of that group, adding the multiplied scores of the groups up and dividing that number by the sum-total of the participants used to calculate the specific aggregated g. This way, the largest datasets were weighted most strongly. This was done for every single item.

Estimating d

Following Rushton (Rushton & Skuy, 2000; Rushton, Skuy, & Fridjhon 2002, 2003; Rushton, Skuy, & Bons, 2004) first of all the pass-rates from every group were changed to standard scores by use of a Z-transformation before being subtracted from each other. It should be noted that the descriptions in Rushton’s Method section do not always match the outcomes. However, our computations are in line with Rushton’s outcomes. Score differences between two groups (d) were computed by subtracting the mean proportion correct of the lower scoring group from the mean proportion correct of the higher scoring group (to generally obtain positive scores). Groups were matched based on average age and education.

**Results**

We conducted a series of seven meta-analyses with the data that we gathered. The first four meta-analyses were conducted using the original g loadings from individual datasets. The last three meta-analyses were conducted using the aggregated g loadings matching the age range of the comparisons of d.

Because the SPM and APM are very similar tests and the value of the SD of the g loadings are nearly the same -.120 for the SPM and .114 for the APM – we combined the data points from the two tests. For our analyses we used the meta-analytical program from Schmidt and Le (2004) to calculate the meta-analytical correlations between d and g, based upon the corrections for the five statistical artifacts mentioned in Method.

For the first analysis we included all the data that we gathered. Figure 4.1 shows all the 34 combinations of r(d x g) that were used in this meta-analysis.

The outcomes of the first analysis are reported in Table 1. K stands for the number of studies, N_{H} stands for the total harmonic N_{H}, r stands for the bare-bones correlation between d and g, SD_{r} is the standard deviation of this correlation. Rho-4 stands for the correlation between d and g corrected for the first four statistical artifacts, and SD_{rho-4} is the standard deviation of this correlation. Rho-5 stands for the correlation between d and g corrected for all five of the statistical artifacts with the %VE showing the percentage explained variance between the number of studies (K). The 80% CI stands for the credibility interval. The correlation between d x g is 1.27 and the percentage explained variance is only 12.

The very low percentage of variance explained suggest that there are strong moderators. The first meta-analysis consisted of data points of r(d x g) in the age range 14-30. Since we had just four data points with an average age of 14, and 30 data points with an age range of 17-30 we decided to exclude these four data points (see Table 8) from our second meta-analysis (Figure 4.2). In other words, we tested age as a moderator.

The meta-analytical outcomes in Table 2 show a correlation between d x g of .48 and a percentage explained variance of 54. The increase in percentage variance explained clearly shows that age acted as a moderator.

As the next step, we decided to exclude the data points using the study of Vigneau and Bors (2005) (see Table 8) for d x g, g x g, and d x d because the sample size of the White group from Vigneau and Bors (2005) deviated too much from the Black samples which they were compared with. The reason we didn’t exclude these data points in the first place is because of the general principle from a Schmidt and Hunter-style psychometric meta-analysis to include all data, and to exclude extreme outliers and outliers at a later stage. Excluding the data points using the Vigneau and (2005) study left 28 data points for the third meta-analysis, and these data points are shown in Figure 4.3.

The meta-analytical outcomes in Table 3 show a correlation between d x g of .58 and a percentage explained variance of 81.

For the fourth and last meta-analysis using the original g loadings we excluded two further outliers (see Table 8). The data points of the Whites and Indians, and Indians and Blacks from Rushton, Skuy, and Bons (2004) correlated negatively with the g loadings of the White sample from Rushton et al. (2004). Since both the negative correlations of d x g came from the same sample we decided to exclude that sample from our fourth meta-analysis. This left a total of 26 combinations of d x g, as displayed in Figure 4.4.

The meta-analytical outcomes in Table 4 show a correlation between d x g of .62 and a percentage explained variance of 104. Leaving out the outliers strongly increased the amount of variance explained.

The next series of three meta-analyses was conducted using aggregated g loadings. We did not correct for the second statistical artifact, the reliability of the vector of g loadings, because the aggregated g that we use for both the SPM as the APM is based on such a large sample that is has virtually perfect reliability.

For the first of these three meta-analyses we included all the combinations of d x aggregated g. This resulted in the 34 data points shown in Figure 5.1.

The meta-analytical outcomes in Table 5 show a correlation between d x aggregated g of .93 and a percentage explained variance of 12.

Since the percentage explained variance of the 34 data points was quite low, we decided to test for moderators. For the second of this series of meta-analyses we excluded the samples where the age of the participants was lower than 17. This excluded four data points of d x aggregated g (see Table 8), leaving the 30 data points displayed in Figure 5.2.

The meta-analytical outcomes in Table 6 show a correlation between d x aggregated g of .62 and a percentage explained variance of 13.

The percentage explained variance for these 30 data points is still very low. Inspection of the scatter plot shows there are outliers that cause the low amount of explained variance. The two data points based upon data from the study from Vigneau and Bors (2005) were identified as the extreme outliers. For the third and final meta-analysis of this series we excluded these outliers (see Table 8). This resulted in a total of 28 combinations of d x aggregated g displayed in Figure 5.3.

The meta-analytical outcomes in Table 7 show a correlation between d x aggregated g of .83 and a percentage explained variance of 67.

The value of the correlation calculated with the aggregated g is substantially higher than the value of the correlation calculated with the original g (.83 versus .62, respectively). We will elaborate on this in the following section.

**Discussion**

Spearman’s Hypothesis states that group differences in IQ scores are a function of the cognitive complexity of these IQ scores. In order to answer the question if group differences in intelligence are on the g factor we conducted meta-analyses using data of both Raven’s Standard Progressive Matrices, and Raven’s Advanced Progressive Matrices. Both tests measure the construct of general intelligence quite well and are thus well suited to answer our question. We tested Spearman’s Hypothesis by correlating g loadings of items with the group differences on the items; a high correlation indicates that group differences can be explained by the cognitive complexity of the items. To test if there are differences we used groups of people from different ethnical backgrounds.

Differences in IQ for individuals within one homogeneous group have been shown to be strongly heritable (Jensen, 1998). This is an important finding, because general intelligence is the key predictor for job- and educational performance. However, there is no consensus to what degree differences between groups are heritable. Te Nijenhuis and Grimen (2007) showed there is a meta-analytical correlation of 1 between g loadings and heritabilities of subtests of an IQ battery. So, a high meta-analytical correlation between g loadings and group differences would strongly suggest there may be a substantial genetic component to group differences.

Based on a large number of studies we found a very strong meta-analytical correlation between g loadings of items and group differences on items. Also, the amount of variance explained in the data points in the meta-analysis is quite high. As a previous meta-analysis has already shown that g loadings and heritabilities are virtually interchangeable, the high correlation between g loadings and group differences in the present meta-analysis is in line with the hypothesis of a substantial genetic component in group differences.

First, we calculated the g loadings based upon individual studies with the d scores. This yielded a correlation, after correcting for statistical artifacts, between d x g of .62. However, when we used an aggregated g based on all relevant data we found a much higher correlation of .83. So, there is a difference in meta-analytical outcomes between using the g loadings from individual studies and the aggregated g loadings. When we take the aggregated g loadings based on a very large total sample as the most optimal estimate, it suggests our estimates based on individual studies are systematically too low.

The correlation between d and our aggregated g of .83 is strong indicating that the difference found in item scores between Blacks, Indians, Roma, and Whites can be strongly explained by the factor of general intelligence. This leaves little room for alternative interpretations, such us cultural bias.

Jensen (1998) has shown that when Black and White children are matched on g, Black children outscore White children on subtests of short-term memory and White children outscore Black children on certain spatial subtests. When there are several of these subtests in a test of Spearman’s hypothesis, they can be considered outliers, and outliers have the well-known effect of a lowering the correlation. Te Nijenhuis and Dragt (2010) reported a meta-analytical correlation of .91, but taking into consideration that many of the datasets in their meta-analysis included subtests measuring short-term memory and specific spatial abilities, the correlation between true d and true g without these two outliers is virtually indistinguishable from a correlation of 1.00. So, instead of concluding there is only little room left for cultural explanations of group differences, one could argue there is no room or virtually no room left for cultural explanations.

The same reasoning could be applied to the present psychometric meta-analysis at the item level. Lynn, Allik, and Irwing (2004) convincingly show that the Raven’s, besides g, also measures 1) Gestalt continuation, 2) Verbal-analytical reasoning, and 3) Visio spatial ability. Insofar as there are group differences in the scores on these factors, in tests of Spearman’s hypothesis the relevant items will function as outliers, and will lower the meta-analytical correlation. Follow-up research is necessary to test this hypothesis.

During this project we came across a lot of challenges, mostly of a statistical nature. We consulted renowned researchers in our effort to find the optimal ways to calculate g loadings. Eventually we came up with a technique, as specified in Method, that we strongly believe is the correct way to calculate these g loadings. A further statistical issue was how researchers normalized their data. In most of the articles it is simply stated that the data were normalized to standard scores, not specifying how this was done. In our opinion this confusion ends when researchers clearly describe this normalization. The accessibility of this statistically complex research will increase by explaining every single step.

You are on a roll. I notice you’ve been mapped on the Dark Enlightenment. Not that I would have anything to do with those guys.

There are a lot of studies I like. I post this one so that in the future I can rely on it if I need to replicate one analysis or two using meta-analysis. I think the test of Spearman Hypothesis is really important, as the confirmation of such hypothesis would discredit the cultural hypothesis, because it means that BW gaps have a common cause, and not different causes.

This aside, I don’t know much about Dark Enlightment. I have read only a few articles. But I know it has something to do with HBD stuff. Oh, and speaking of ethnic diversity, I find the “Richwine story” very intriguing, and fascinating. Someone must have casted a curse on him. Everyone wants his head.

Mainstream journalists are not allowed to publicly discuss these possibilities in America. That’s why I support an across-the-board moratorium on immigration from all countries until we can assimilate the 70 million or so foreign-born minorities and their children who are already here.

Are you familiar with the late Lawrence Auster’s classic essay on immigration and multiculturalism? http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDAQFjAA&url=http%3A%2F%2Fjtl.org%2Fauster%2FPNS.pdf&ei=NqesUdDqJIHo8gSh-oGgDw&usg=AFQjCNECViZc6_tWimUvZNXSJRAvjb-ewQ&sig2=UEpJgECpmYa9FepHK5Okjw