David A. Cole

Vanderbilt University

Scott E. Maxwell

University of Notre Dame

R. M. Baron and D. A. Kenny (1986) provided clarion conceptual and methodological guidelines for testing mediational models with cross-sectional data. Graduating from cross-sectional to longitudinal designs enables researchers to make more rigorous inferences about the causal relations implied by such models. In this transition, misconceptions and erroneous assumptions are the norm. First, we describe some of the questions that arise (and misconceptions that sometimes emerge) in longitudinal tests of mediational models. We also provide a collection of tips for structural equation modeling (SEM) of mediational processes. Finally, we suggest a series of 5 steps when using SEM to test mediational processes in longitudinal designs: testing the measurement model, testing for added components, testing for omitted paths, testing the stationarity assumption, and estimating the mediational effects.

Tests of mediational models have been an integral component of research in the behavioral sciences for decades. Perhaps the prototypical example of mediation was Woodsworth’s (1928) S-O-R model, which suggested that active organismic processes are responsible for the connection between stimulus and response. Since then, propositions about mechanisms of action have pervaded the social sciences. Psychopathology theory and research are no exceptions. For example, Kanner, Coyne, Schaeffer, and Lazarus (1981) proposed that minor life events or hassles mediate the effect of major negative life events on physical and psychological illness. A second example was the proposition that norepinephrine depletion mediates the relation between inescapable shock and learned helplessness (e.g., Anisman, Suissa, & Sklar, 1980; cf. Seligman & Maier, 1967). A third example is the drug gateway model, which suggested that the connection between initial legal drug use and hard drug use is mediated by regular legal drug use and exposure to a drug subculture (Ellickson, Hays, & Bell, 1992). A fourth example involves an elaboration of the transgenerational model of child maltreatment, positing that parents who were abused as children are more likely to abuse their own children because of the parents’ unmet needs and unrealistic expectations (Steele & Pollack, 1968; Twentyman & Plotkin, 1982). The list goes on.

In all of these examples, a mediator is a mechanism of action, a vehicle whereby a putative cause has its putative effect. The causal chain that must pertain for a variable to be a mediator is most easily described with the aid of path diagrams. For cross-sectional data, Model 1 (in Figure 1) depicts the hypothetical situation in which the mediator variable M at least partially explains the causal relation between the exogenous variable X and the outcome variable Y. As such, M is endogenous relative to X, but exogenous relative to Y. This path diagram is a pictorial representation of two related regression equations. When X, M, and Y are standardized, these equations are Mt = a’Xt + eM and Yt = b’Mt + c’Xt + eY, where all data are collected at the same time (t). Path coefficients a’, b’, and c’ are standardized regression weights. (We use primes after the path coefficients to indicate that they derive from a cross-sectional research design and to distinguish them from a, b, and c, which are their counterparts in longitudinal designs.) Finally, eM and eY are the residuals (i.e., that part of M and X not explained by their respective sets of upstream variables). **[1]**

In path analysis, we commonly refer to three types of effects: total effects, direct effects, and indirect effects. The total effect is the degree to which a change in an upstream (exogenous) variable such as X has an effect on a downstream (endogenous) variable such as Y. A direct effect is the degree to which a change in an upstream (exogenous) variable produces a change in a downstream (endogenous) variable without “going through” any other variable. Thus the direct effect of X on M is represented by Path Coefficient a’; the direct effect of M on Y is Path b’; and the direct effect of X on Y is Path c’. An indirect effect is the degree to which a change in an exogenous variable produces a change in an endogenous variable by means of an intervening variable. In Model 1, X has an indirect effect on Y through M. Given that the variables are standardized, the indirect effect of X on Y through M is equal to the product of associated paths, a’b’. In the current example, there is only one indirect effect; however, if there were more than one route (or tracing) through intervening variables, the overall indirect effect would equal the sum of the product terms representing each of the tracings (see Kenny, 1979, for an excellent review of the tracing rules in path analysis). Finally, the total effect is simply the overall effect that an exogenous variable has on an endogenous variable whether or not the effect runs through an intervening variable. The total effect equals the direct effect plus all indirect effects. In Model 1 the total effect of X on M equals c’ + a’b’.

According to Kenny et al. (Baron & Kenny, 1986; Judd & Kenny, 1981; Kenny, Kashy, & Bolger, 1998; see also MacKinnon & Dwyer, 1993), a variable serves as a mediator under the following conditions. First, X has a direct effect on M (i.e., a’ ≠ 0). Second, M has a direct effect on Y, controlling for X (i.e., b’ ≠ 0). Third, if M completely mediates the X-Y relation, the direct effect of X on Y (controlling for M) must approach zero (i.e., c’ -> 0). Alternatively, if M only partially mediates the relation, c’ may not approach zero. Nevertheless, an indirect effect of X on Y through M must be present (i.e., a’b’ ≠ 0). In the social science, c’ never completely disappears, leading MacKinnon and Dwyer (1993) to recommend computing the proportion of the total effect that is explained by the mediator (in this case, a’b’/(c’+a’b’). **[2]**

These formulations and their associated statistical tests (Baron & Kenny, 1986; Sobel, 1982) have been enormously helpful in guiding researchers who study mediational models. These procedures are limited, however, in that they do not provide explicit extensions to longitudinal designs. This is unfortunate, in that mediation is a causal chain involving at least two causal relations (e.g., X->M and M->Y), and a fundamental requirement for one variable to cause another is that the cause must precede the outcome in time (Holland, 1986; Hume, 1978; Sobel, 1990).

Inferences about causation (and hence mediation) that derive from cross-sectional data teeter on (often fallacious) assumptions about stability, nonspuriousness, and stationarity (Kenny, 1979; Sobel, 1990). We submit that without clear methodological guidelines for longitudinal tests of mediational hypotheses, problematic procedures will emerge and potentially erroneous conclusions will be drawn. Consequently, the goals of the current article are to review some of the more common problems, to highlight some of the possible consequences, and to propose a procedure that extends Kenny et al.’s cross-sectional methodology to the longitudinal case.

In this article, we limit ourselves to the examination of changes in individual differences over time. Thus our focus is on traditional regression-based designs (following in the Baron & Kenny, 1986, tradition) in general, and on linear relations in particular. We acknowledge alternative conceptualizations that focus on latent growth curves (Curran & Hussong, 2003; Rogosa & Willett, 1985) and individual differences in change (Bryk & Raudenbush, 1992). For an example, see Chassin, Curran, Hussong, and Colder’s (1996) study on the effect of parent alcoholism on growth curves in adolescent substance use. We also acknowledge longitudinal models that combine latent state and latent trait variables, as described by Kenny and Zautra (1995) and by Windle and Dumenci (1998). Finally, we note the importance of randomized experimental designs in testing mediational and other causal relations. In the current article, however, our focus is on observational data, such as might be gathered when the actual manipulation of variables is impractical or unethical.

**[1]** For heuristic purposes, we have frequently used examples in which the measures are standardized. In general, investigators should use unstandardized variables when conducting structural equation modeling (Tomarken & Waller, 2003). This is especially true when one is testing the equality of various parameters, as occurs when testing the stationarity assumption.

**[2]** Shrout and Bolger (2002) point out that c’ can be negative, in which case the proportion can exceed 1.0, either in a sample or in the population. Previous work by MacKinnon, Warsi, and Dwyer (1995) suggests that quite large samples (e.g., > 500) are often needed to obtain estimates of this proportion with acceptably small standard errors.

**Some Common Questions**

In our review of the clinical and psychopathology literatures, we found a substantial number of studies that purport to test mediational models with longitudinal data. Interestingly, we found almost as many methods as articles and almost as many problems as methods. In this section, we highlight some of the more common questions and problems that arise. Our goal is not to cast aspersions on particular research groups; rather, we seek to provide better guidelines for future research. (Indeed, some of our own studies exemplify one of the problem areas.) In fact, we found no study that avoided all of the potential pitfalls.

Here we use three ostensibly similar terms that refer to different types of change over time: stability, stationarity, and equilibrium. First, Kenny (1979) stated that stability “refers to unchanging levels of a variable over time” (pp. 231–232). For example, children’s vertical physical growth ceases at about age 20, at which point the variable height exhibits stability. Second, Kenny noted that stationarity “refers to an unchanging causal structure” (p. 232). In other words, stationarity implies that the degree to which one set of variables produces change in another set remains the same over time. For example, if nutrition were more important for physical growth at some ages than at others, the causal structure would not exhibit stationarity. Researchers interested in causal modeling often assume stationarity; however, we should note that changes in causal relations over time can have substantive implications in their own right. Third, Dwyer (1983) stated that equilibrium refers to a causal system that displays “temporal stability (or constancy) of patterns of covariance and variance” (p. 353). In the previous example, the system is at equilibrium when the cross-sectional variances and covariance (and thus the correlation) of nutrition and physical growth are the same at every point in time. **[3]**

Somewhat surprisingly, a system that exhibits stationarity is not necessarily at equilibrium. For example, in Figure 2 the reciprocal effects between X and Y have just begun. (Note that there is no Time 1 correlation.) Nevertheless, the system instantly manifests stationarity, in that the magnitudes of the causal paths are identical at every lag. Nevertheless, the system is not at equilibrium during the early waves, because the within-wave correlation continues to increase. At about Wave 5, we see that the within-wave correlation converges at a relatively constant value, .40, from which it will never change (unless there is a change in the causal parameters). At this point, the system has reached equilibrium. **[4]**

**[3]** In the time-series literature, these three concepts are combined under a single term, “stationarity,” referring to constancy of means, causal processes, and variances– covariances over time.

**[4]** It is also possible for a nonstationary system to be in equilibrium, although such processes are more difficult to exemplify.

**Question 1: What Do Cross-Sectional Studies Tell Us About Longitudinal Effects?**

We might all agree that longitudinal designs enable us to test for mediation effects in a more rigorous manner than do cross-sectional designs; however, most of us would like to think that cross-sectional evidence of mediation surely tells us something about the longitudinal case. In fact, cross-sectional designs are often used to justify the added time and expense of longitudinal studies. In reality, testing mediational hypotheses with cross-sectional data will be accurate only under fairly restrictive conditions. Furthermore, estimating mediational effect sizes will only be accurate under even more restrictive circumstances. When these conditions do not pertain, cross-sectional studies provide biased and potentially very misleading estimates of mediational processes (see Gollob & Reichardt, 1985).

Let us imagine the longitudinal variable system depicted by Model 2 in Figure 1. In this model, X at time t is a function only of X at time t-1 and error: X_{t} = xX_{t-1} + ε_{Xt}. The mediator M is a function of prior M and prior X: Mt = mM_{t-1} + aX_{t-1} + ε_{Mt}. Similarly, the outcome Y is a function of prior Y and prior M: Y_{t} = yY_{t-1} + bM_{t-1} + ε_{Yt}. Thus, Path a represents the effect of X_{t-1} on Mt, controlling for M_{t-1}. Likewise, Path b represents the effect of M_{t-1} on Y_{t}, controlling for Y_{t-1}. As there is no direct effect of X on Y, M completely mediates the X->Y relation. In this model, we assume that two simplifying conditions pertain: (1) The processes are stationary (i.e., the causal parameters are constant for all time intervals of equal duration), and (2) the system has reached equilibrium (i.e., the cross-sectional variances and covariances of Xt, Mt, and Yt are constant for all values of t).

Now let us imagine that we have data only at time t. In other words, we have three cross-sectional correlations with which to address the mediational hypothesis. The critical question is: When complete longitudinal mediation truly exists (as in Model 2), can we ever expect cross-sectional data to show that M completely mediates the relation between X and Y? Complete cross-sectional mediation occurs when c’ = 0 in Model 1, in which case p_{XY} equals a’b’, which implies that p_{XY} = p_{XM}p_{MY}. If complete longitudinal mediation truly exists and the system has reached equilibrium, the equation for complete cross-sectional mediation (i.e., p_{XY} = p_{XM}p_{MY}) holds under only three circumstances: (1) the trivial case when a = 0 or b = 0, (2) the unlikely case when x = 0, or (3) the peculiar case when X and M are equally stable over time (i.e., p_{XtXt-1} = p_{MtMt-1}). (Proof of these conditions is available from the first author upon request.) Under all other circumstances, the XY correlation will not be explained by the XM and MY correlations, even when M completely mediates the X-Y relation in Model 2.

Even when these (very) special conditions do hold, there is no guarantee that the cross-sectional paths a’ and b’ will accurately represent their longitudinal counterparts, a and b. Given the same assumptions described above, a’ will equal a and b’ will equal b only when m=y=(1-x)/x. (Proof of these conditions is also available from the first author upon request.) To show how limiting this condition is, we graphed this function in Figure 3. Only when the values of x, m, and y fall exactly on the plotted line will a’=a and b’=b. When the values for x, m, and y fall above the line in the vertically shaded area, the cross-sectional correlations will overestimate the longitudinal effects a and b. When the values for x, m, and y fall below the line in the horizontally shaded area, the cross-sectional correlations will underestimate Paths a and b. In sum, the conditions under which cross-sectional data accurately reflect longitudinal mediational effects would seem to be highly restrictive and exceedingly rare. **[5]**

An important aside. Simply allowing a time lag between the predictor and the outcome is not sufficient to make a’ and b’ unbiased estimates of a and b, respectively. One of the most important potential benefits of longitudinal designs is the opportunity to control for an almost ubiquitous “third variable” confound, prior levels of the dependent variable (Gollob & Reichardt, 1991). When predicting a dependent variable at time t from an independent variable at time t-1, we cannot use regression to infer causation if there are any unmeasured and uncontrolled exogenous variables that correlate with the predictor variable and cause the dependent variable. In most longitudinal designs, prior levels of the dependent variable (at time t-1) represent such a variable. Without controlling for such potential confounds, we will obtain spuriously inflated estimates of the causal path of interest. In mediational models, M_{t-1} must be controlled when predicting Mt, and Y_{t-1} must be controlled when predicting Yt.

Modeling tip. Structural equation modeling (SEM) cannot atone for shortcomings in the research design. To make the kinds of causal inferences implied by a mediational model, the researcher should collect data in a fashion that allows time to elapse between the theoretical cause and its anticipated effect. Ideally, the researcher will collect data on the cause, the mediator, and the effect at each of these time points (or waves). Such data allow the investigator to implement statistical controls for prior levels of the dependent variables using SEM.

**[5]** Gollob and Reichardt (1991) described a creative longitudinal model that can be tested with cross-sectional data (see Question 5). Their approach nicely forces the investigator to make explicit a rather large number of assumptions.

**Question 2: When Is a “Third Variable” Really a Mediator?**

The discovery that some third variable, Z, statistically explains the relation between X and Y is not sufficient to make it a mediator. The mere fact that a nonzero correlation between X and Y goes to zero when some variable Z is covaried does not mean that Z “mediates” the X->Y relation. In Models 3 and 4, controlling for Z can completely eliminate the direct effect of X on Y, but in neither situation does Z act as a mediator (see Figure 4). Sobel (1990) pointed out that a mediator must satisfy at least two additional requirements: Z must truly be a dependent variable relative to X, which implies that X must precede Y in time; and Z must truly be an independent variable relative to Y, implying that Z precedes Y in time.

A mediator cannot be concurrent with X. Time must elapse for one variable to have an effect on another. Disentangling the effects of concurrent, correlated predictors is an important enterprise, but it does not represent a test of a mediational model. Such situations frequently arise in comorbidity research. For example, Jolly et al. (1994) attempted to disentangle the effects of depression and anxiety on adolescent somatic complaints. In a series of regressions, they noted that the relation between depression and somatic symptoms was nonsignificant after controlling for anxiety. Concluding that anxiety “mediated” the relation between depression and somatic complaints, Jolly et al. interpret their results in light of the tripartite model of negative affectivity (e.g., Watson & Clark, 1984). We argue that the use of the term “mediation” is misleading here because it implies that depression causes anxiety, a stipulation that is neither made by the tripartite model nor supported by the data. **[6**] The actual situation is represented by Model 4 (see Figure 4), in which the apparent relation between depression (X) and somatic symptoms (Y) is spurious. Anxiety (Z), which is correlated with X, is the actual cause of Y.

The timing of the measure may be different than the timing of the construct. The distinction between the measurement of the constructs and the constructs themselves is absolutely critical. Sometimes researchers appear to assume that measuring X before Y somehow makes it true that the construct X actually precedes Y in time. In many cases, the measure of Y (or M) actually assesses a condition that began long before the occurrence of X. For example, McGuigan, Vuchinich, and Pratt (2000) examined the degree to which parental attitudes about their infant (M) mediated the relation between spousal abuse (X) and child maltreatment (Y). Because they assessed risk for child maltreatment 6 months after they assessed parental attitudes, they concluded that child abuse must be a consequence of parental attitudes, not a cause (p. 615). The trouble is that their measure of Y (child abuse risk) included some potentially very stable factors such as the parents’ social support network and the stability of the home environment. In other words, their measure of Y may have tapped into a construct that predated both X and M.

Modeling tip. The causal ordering of X and M cannot be determined using cross-sectional data. Models 1, 3, and 4 cannot be empirically distinguished from one another. (Technically speaking, they are all just-identified.) In all three models, the apparent relation between X and Y is explained (to the same extent) by the third variable. Although these models are conceptually very different, they are mathematically equivalent to one another. In the context of a longitudinal design, however, such models are not equivalent. Indeed, tests are possible that at least begin to distinguish among them. For example, the causal effects (a, b, and c) that are implied by mediation are represented by longitudinal Model 5 in Figure 5, whereas alternative causal processes (Paths d, e, and f) can be represented by Model 6 (see Figure 5). Although these models are not hierarchically related to each other, they are nested under the fuller Model 7 (see Figure 5). The comparison of Models 6 and 7 tests the significance of Paths a, b, and c. The comparison of Models 5 and 7 tests the significance of Paths d, e, and f. Such comparisons help clarify causal order and to identify concurrent causal processes in which the mediational model of interest may be imbedded. (Later we present a more general framework for model comparisons, which allows for the testing and estimation of an even wider variety of effects.)

**[6]** Indeed, some research actually suggests the reverse may be true (Cole, Peeke, Martin, Truglio, & Seroczynski, 1997; Brady & Kendall, 1992).

**Question 3: What Are the Strengths and Weaknesses of the “Half-Longitudinal” Design?**

Many longitudinal studies rigorously test the prospective relation between M and Y, but examine only the contemporaneous relation between X and M. This is one manifestation of what we call a “half-longitudinal design.” For example, Tein, Sandler, and Zautra (2000) examined the capacity of psychological distress to mediate the effect of undesirable life events on various parenting behaviors. The study has much to commend it, including the control for prior parenting behavior and the rigorous examination of indirect effects. A weak point, however, is that the measures of negative life events (X) and perceived distress (M) were obtained concurrently. Consequently, in the assessment of the X->M relation, prior levels of M could not be controlled. In cases such as this, the effect of X on M will be biased, in part because X and M coincide in time and in part because prior levels of M were not controlled. Indeed, the nature of this bias is identical to that depicted in Figure 3.

Another manifestation of the “half-longitudinal design” tests the prospective relation between X and M, but examines only the contemporaneous relation between M and Y. Measures of M and Y are obtained concurrently. In one of our own studies (Cole, Martin, & Powers, 1997), we hypothesized that self-perceived competence (M) mediated the relation between appraisals by others (X) and the emergence of children’s depressive symptoms (Y). Testing the effect of others’ appraisals (X) on self-perceived competence (M) was longitudinal and rigorous, but the examination of the relation between self-perceived competence (M) and depression (Y) was cross sectional. The mediator and the outcome variable were measured at the same time. In such cases (even though prior depression was statistically controlled), the estimate of the effect of M on Y will be biased.

Modeling tip. When the design has only two waves, all is not lost. Let us assume that X, M, and Y are measured at both times, as in the first two waves of Model 5 (in Figure 5). In such designs, we recommend a pair of longitudinal tests: (1) estimate Path a in the regression of M2 onto X1 controlling for M1 and (2) estimate Path b in the regression of Y2 onto M1 controlling for Y1. If we can assume stationarity, Path b between M1 and Y2 would be equal to Path b between M2 and Y3. Under this assumption, the Product ab provides an estimate of the mediational effect of X on Y through M. We submit that this approach is superior to the biased approaches typically applied to the half-longitudinal design. **[7]** Nevertheless, two shortcomings do emerge. First, although we can estimate ab, we cannot directly test the significance of Path c. In other words, we can test whether M is a partial mediator, but we cannot test whether M completely mediates the X-Y relation. Second, the stationarity assumption may not hold. If the mediation assumption is false, the ab estimate will be biased. And without at least three waves of data, the stationarity assumption cannot be tested. Despite these shortcomings, we suspect that failing to control for prior levels of the dependent variables typically creates much greater problems than does failing to take into account violations of stationarity.

**[7]** Collins, Graham, and Flaherty (1998) stated that three waves are necessary to test mediation. However, they are dealing with transitions from one stage (category) to another, such as in stages of smoking. They essentially assume that the X variable influences M at the same time for everyone, and then at some later time M influences Y. They argued that three waves are needed, because a pair of waves is needed to assess the X to M influence, and a separate pair of waves is needed to influence the M to Y relationship. This is sensible given their initial assumption that X starts the causal chain at the same time for everyone, such as in a randomized treatment design. Our perspective is more general in that we focus on ongoing process, whereas they assume that the process has not yet begun until X has been measured. As a consequence, a second wave is needed to observe the effect of X on M and then a third wave to observe the effect of M on Y.

**Question 4: How Important Is the Timing of the Assessments?**

In other disciplines (e.g., biology, chemistry, physics), researchers engage in substantial pilot research designed to assess the optimal time interval between assessments. In the social sciences, the timing of assessments seems to be determined more by convenience or tradition than by theory or careful research.

How the timing of assessments affects mediational research depends on the nature of the underlying causal relation. For the purposes of this discussion, we make two assumptions. First, we assume that a certain time interval (I) must elapse for one variable to have an effect on another. This implies that the causal relation will not be evident when the assessment interval is less than I. Second, for a specific causal relation (e.g., X->M or M->Y), we assume that the causal effect is the same for all such time intervals throughout the duration of the study (although the interval for one causal relation, I_{XM}, need not be the same for another, I_{MY}). In other words, the processes are stationary. These assumptions may not be appropriate for certain kinds of causal relations. Nevertheless, they are the assumptions that underlie most regression-based studies of causal modeling.

Under these assumptions, a rather counterintuitive situation arises: The assessment time interval that maximizes the magnitude of a simple causal relation (e.g., X->M or M->Y) is not typically the proper interval for estimating the mediational relation, X->M->Y, even when I_{XM} = I_{MY} = I. Furthermore, use of intervals other than I_{XM} and I_{MY} can seriously affect the estimation of mediational relations. To see how this is true, consider the models depicted in Figure 6. In the first case (in the upper panel), the interassessment time interval is exactly I, and the estimated effect of X on M is represented by X1->M2 (and X2->M3, X3->M4, etc.), which is exactly a. In the second case (middle panel of Figure 6), the interval is twice as long (2I). Consequently, the researcher cannot assess X1->M2 (one lag), but is compelled to evaluate two-lag relations (e.g., X1->M3). At first glance, this relation appears to be a single tracing; however, from the upper panel of Figure 6, we see that it actually consists of two tracings: am and ax. Hence, the X1->M3 relation is equal to am + ax. In the third case, the interval is 3I, and the causal effect X1->M4 becomes even more complex, am² + amx + ax², as shown in the lower panel of Figure 6. In general, the estimated causal effect of X on M will equal

a(Σ^{T}_{i=1} m^{i-1}x^{T-i}),

where T is the number of time points in the study. The practical effect of timing can be remarkable. For example, let us assume that the top panel of Figure 6 is the correct model. In that model, we let the causal effect a = .2. To simplify the math, we also let x = m. In Figure 7, we consider the effect of increasing the time lag between assessments (from 1I to 10I) on our estimate of the X->M relation. In Figure 7, we see that the effect of timing on these estimates varies as a function of x (and m). The time interval that yields the largest X->M effect might be 1I, 2I, 3I, 4I, or 10I, depending on the stability of X and M. **[8]** For example, when x = m = .2, the maximum effect of X on M is found for a time interval of 1I; however, when X and M are more stable (.8), the maximum effect is not found until the time interval is four or five times longer! We hasten to add that we do not mean to imply that the interval that maximizes the effect is the “correct” interval and that all other intervals are incorrect. Instead, our intent is simply to demonstrate that the magnitude of the effect can vary greatly depending on the chosen interval. For this reason, researchers would often be well advised to report such effects for a variety of time intervals.

When estimating mediational effects, however, the proper interval between assessments is always 1I. Gollob and Reichardt (1991) showed that estimates of mediational effects are almost always wrong when intervals larger than 1I are used. To demonstrate this, let us consider the upper panel of Figure 8, which contains a three-variable extension of Figure 6. In Figure 8, the five-wave model represents a completely mediational model in which assessments occur at intervals of 1I. The mediational effect from X1 to M2 to Y3 is simply ab; however, this is probably not the effect of greatest interest, especially if the researchers took the pains to continue the study for two more waves. Gollob and Reichardt defined this time-specific indirect effect as the degree to which M at exactly Time 2 mediates the effect of X at exactly Time 1 on Y at exactly Time 3. Most researchers would suggest, however, that mediation does not occur at a discrete point in time but unfolds over the course of the study. Therefore, most researchers will be more interested in the degree to which M at any time between Wave 1 and Wave 5 mediates the effect of X1 on Y5. Gollob and Reichardt dubbed this the overall indirect effect. The concept of overall effects is critical in longitudinal designs, and yet this methodology remains conspicuously underutilized in mediational studies. For these reasons, we digress to reintroduce Gollob and Reichardt’s terms and to describe their calculation in the Appendix.

In the five-variable case of Figure 8, we see that the overall indirect effect consists of six time-specific effects,

1. X1 -> X2 -> X3 -> M4 -> Y5 (abx²)

2. X1 -> X2 -> M3 -> Y4 -> Y5 (abxy)

3. X1 -> X2 -> M3 -> M4 -> Y5 (abmx)

4. X1 -> M2 -> M3 -> M4 -> Y5 (abm²)

5. X1 -> M2 -> M3 -> Y4 -> Y5 (abmy)

6. X1 -> M2 -> Y3 -> Y4 -> Y5 (aby²)

which must be summed:

abx² + abxy + abmx + abm² + abmy + aby²

= ab (x² + xy + mx + m² + my + y²). (1)

In other words, the overall indirect effect of X1->Y5 will be ab(x² + xy + mx + m² + my + y²) when the assessment interval is 1I.

The three-wave model (in the lower panel of Figure 8) represents the same variables, assessed at intervals that are twice as long as those in the five-wave model (i.e., 2I, not 1I). Our estimation of the overall indirect effect in this design is the product of the X1->M3 path (am+ax) and the M3->Y5 path (bm+by). Multiplying the terms reveals that

(am + ax) (bm + by) = abm² + abmx + abmy + abxy

= ab(m² + xy + mx + my). (2)

In other words, our estimate of the overall indirect effect of X1->Y5 will be ab(m² + xy + mx + my) when the assessment interval is 2I.

Comparing Equation 1 to Equation 2, we see that they are identical in the unlikely case in which x²=y²=0 and in the trivial case in which ab=0. In summary, when the assessment interval is longer than 1t, the calculation of the overall indirect (or mediational) effect will misrepresent the true overall indirect effect to the extent that x and y are nonzero. Only by assessing X, M, and Y at intervals of 1t (no longer and no shorter) can we obtain accurate estimates of the mediational effect of interest.

A related question. If the constructs represented by M or Y do not exist at the beginning of the study, can we safely assume they do not need to be statistically controlled? As an example of this situation, Emery, Waldron, Kitzmann, and Aaron (1999) examined the effect of parent marital variables (e.g., married–not married when having children) on child externalizing behavior. In this context, controlling for child externalizing behavior at Time 1 would require the ludicrous task of assessing externalizing symptoms at birth! Nevertheless, we must bear in mind that the dependent variable does exist at time points prior to the end of the study, albeit not as early as the beginning of the study. Every such time point generates another possible tracing between Time 1 marital status and the child outcome variable at time t. To ignore these points results in a test of the time-specific effect of X1 on Yt, not the overall effect over this time interval.

Modeling tip. Three specific recommendations for SEM emerge from this set of concerns. First, researchers interested in estimating mediational effects in longitudinal designs should first conduct studies designed to detect the time interval that must elapse for X to have an effect on M (time interval t_{MX}) and for M to have an effect on Y (time interval t_{YM}). Waves of assessment should be separated by these empirically determined time intervals. The second seems self-evident, but is implemented only rarely. Researchers should specify the developmental time frame over which the mediation supposedly unfolds. Furthermore, their studies and analyses should be designed to represent this period of time in its entirety. Third, we echo Gollob and Reichardt’s (1991) recommendation that researchers use the overall indirect effects, not just the time-specific indirect effects, to represent the mediational effect of interest.

**[8]** We should note that the optimal time interval between assessments is not necessarily the same as the overall duration of the study. The overall duration should reflect the time period of theoretical interest and may consist of multiple iterations of the optimal assessment interval.

**Question 5: When Are Retrospective Data Good Proxies for Longitudinal Data?**

In many attempts to test mediational models, researchers use retrospective measures of the exogenous variables. That is, researchers use data gathered at one point in time to represent the construct of interest at a prior point in time. For example, Andrews (1995) attempted to show that feelings of bodily shame mediated the effect of childhood physical and sexual abuse on the emergence of depression in adulthood. The measure of childhood abuse (X), however, was a retrospective self-report obtained at the end of the study, after the assessment of depression (Y). Given the likelihood that depression will affect memory, a retrospective measure of abuse history may well be biased by the same variable it purportedly predicts (Monroe & Simons, 1991).

The often cantankerous assumptions associated with this procedure are similar to those described for cross-sectional tests of mediation. First, the retrospective measure must be a remarkable proxy for the original construct. To the degree to which the retrospective measure (R) imperfectly represents the true exogenous variable X, the effect of X on either M or Y will be underestimated. (Such problems can be partially alleviated with the use of multiple retrospective measures of X.) Second, the retrospective measure cannot be directly affected by current levels of the underlying construct; that is, any relation between R and current levels of X must be due to the stability of X. Any direct relation between R and current levels of X will lead to the overestimation of the relation between exogenous X and other downstream variables. Third, the retrospective measure must not be affected by prior or concurrent levels of M or Y. As many retrospective measures rely on the participants’ memories (and knowing that many factors affect memory), such assumptions are frequently questionable. On the one hand, if the retrospective measure is positively affected by other variables in the study, the effect of X on either M or Y will be overestimated. On the other hand, if other variables impair the efficacy of the retrospective measure, the effect of X on M or Y may be underestimated. Faced with threats of both over- and underestimation, the investigator may retain little confidence in the mediational tests of interest.

Modeling tip. Our most fervent recommendation is that researchers avoid relying on retrospective measures. If this is impossible, however, the researcher should review Gollob and Reichardt’s (1987, 1991) procedures for fitting longitudinal models with cross-sectional data. Such procedures are especially helpful in clarifying the various assumptions that the researcher must make. Some (but not all) of these problems can be diminished with the use of multiple measures. Other problems involve assumptions that are themselves immanently testable; however, testing such assumptions requires longitudinal data.

**Question 6: What Are the Effects of Random Measurement Error?**

Most researchers are aware of the attenuating effects of random measurement error on the estimation of correlations. As the reliability of one’s measures diminishes, uncorrected correlations (between manifest variables) will systematically underestimate the true correlations (between latent variables). Some researchers may be tempted to rationalize such problems as errors of overconservatism: By underestimating true correlations, one may occasionally fail to reject the null hypothesis, but at least one is unlikely to commit the more egregious Type I error by rejecting the null hypothesis inappropriately. In tests of mediational models, however, the problems are more complex and insidious. Not only does measurement error contribute to the underestimation of some parameters, but it systematically results in the overestimation of others. Depending on the parameter in question, measurement error can increase the likelihood of both Type I and Type II errors in ways that render the mediational question almost unanswerable. This occurs, in part, because measurement error in X, M, and Y have different effects on estimates of Paths a’, b’, and c’.

To present these effects most clearly, we use the hypothetical example depicted by Model 1 (see Figure 1). Let us imagine that in this example, where M completely accounts for the X-Y relation, the true correlations are pXM = .80, pMY = .80, and pXY = .64. If we had perfect measures of these constructs, the standardized path coefficients would be a’ = .80, b’ = .80, and c’ = 0. In most cases, however, our measures will contain random error. Consequently, our observed correlations will not be as large as the true correlations, and our estimates of a’, b’, and c’ will be distorted. We examine the effects of measurement error one variable at a time in the context of this (perfect) mediational model.

When X is measured with error (but M and Y are not), Path a’ will be systematically biased. If pXX is the reliability of our measure of X, Path a’ will be underestimated by a factor of p½XX. As shown in Table 1, the resulting bias in Path a’ can be considerable, whereas the estimates of Paths b’ and c’ are utterly unaffected. **[9]** When Y is measured with error (but X and M are not), Path b’ becomes the underestimated path; however, Paths a’ and c’ remain unaffected (see Table 1). Taken together, the effects of unreliability in X and Y combine to underestimate dramatically the indirect effect a’b’ and diminish our power to detect its statistical significance.

Perhaps most interesting (or disturbing) is the effect of unreliability in the measurement of the mediator. When M is measured with error, Paths a’ and b’ are underestimated, but Path c’ is actually overestimated (see Kahneman, 1965). As shown in Table 1, unreliability in the measure of M creates a downward bias in our estimate of Path a’, an even larger downward bias in Path b’, and a substantial upward bias in Path c’; that is, the indirect effect a’b’ will be spuriously underestimated, and the direct effect c’ will be spuriously inflated (artificially increasing the chance of rejecting the null hypothesis). When the mediator is measured with error, all path estimates in even the simplest mediational model are biased. In more complex longitudinal tests of mediational models, investigators typically seek to control not just the mediator but prior levels of the dependent variable as well. Under such circumstances, the biasing effects of measurement error become utterly baffling.

Modeling tip. Few (if any) psychological measures are completely without error. Consequently, psychological researchers who are interested in mediational models almost always face problems such as these (Fincham, Harold, & Gano-Phillips, 2000; Tram & Cole, 2000). Fortunately, as various authors have noted, the judicious use of latent variable modeling represents a potential solution (Bollen, 1989; Kenny, 1979; Loehlin, 1998). When each variable is represented by multiple, carefully selected measures, the investigator can extract latent variables with which to test the mediational model. Assuming that each set of manifest variables contains congeneric measures of the intended underlying construct, these latent variables are without error. Therefore, estimates of Paths a’, b’, and c’ between latent X, M, and Y will not be biased by measurement error. Indeed, researchers are turning to latent variable SEM with increasing regularity in order to test their mediational hypotheses (e.g., Dodge, Pettit, Bates, & Valente, 1995).

Never is the selection of measures more important than in longitudinal designs, if only because the researcher must live with these instruments wave after wave. We urge researchers to obtain multiple measures of X, M, and Y (but especially M), for then the investigator can examine the relations among latent variables, not among manifest variables, using SEM. In the ideal case, such measures will involve the use of maximally dissimilar methods of assessment, reducing the likelihood that the extracted factor will contain nuisance variance. Of course, simply obtaining multiple measures is not sufficient (Cook, 1985). The measures must converge if they are to be used to extract a latent variable.

**[9]** In general, Path c will also be biased toward zero. In the current example of complete mediation, Path c is already zero and cannot be biased any further in that direction. In examples of partial mediation, however, unreliability in X will cause Path c to be underestimated. See Greene and Ernhart (1991) for examples.

**Question 7: How Important Is Shared Method Variance in Longitudinal Designs?**

Successful control of shared method variance begins with the careful selection of measures; it does not begin with data analysis. Investigators who implement post hoc corrections for such problems will at best incompletely control for shared method variance and at worst render their results hopelessly uninterpretable. In the ideal study, shared method variance would not exist, either because the measures contain no method variance or because every construct is measured by methods that do not correlate with one another. Unfortunately, neither of these scenarios is particularly likely in the social sciences. As an alternative, Campbell and Fiske (1959) proposed the multitrait–multimethod (MTMM) design, in which researchers strategically assess each construct with the same collection of methods. In its complete form, the design is completely crossed, with every trait assessed by every method. Several papers have described latent variable modeling procedures for the analysis of MTMM data (Cole, 1987; Kenny & Kashy, 1992; Widaman, 1985).

In longitudinal research, problems with shared method variance are almost inevitable. When assessing the same constructs at several points in time, researchers typically use the same measures. Therefore, the covariation between data gathered at two points in time will reflect both the substantive relations of interest and some degree of shared method variance. Marsh (1993) noted that failure to account for shared method variance in longitudinal designs can result in substantial overestimation of the cross-wave path coefficients of interest. On the one hand, if the investigator measured each construct with a set of methodologically distinct measures, advanced latent variable SEM (allowing correlated errors) provides a possible solution. On the other hand, if the investigator used measures that were methodologically similar to one another, shared method variance may be inextricably entwined with the constructs of interest. Even the most sophisticated data analytic strategy may be unable to sift the wheat from the chaff.

For example, Trull (2001) proposed that impulsivity and negative affectivity mediate the relation between family history of abuse (X) and the emergence of borderline personality disorder features (Y). By necessity, most of the constructs were assessed by various forms of self-report (e.g., interview, paper-and-pencil questionnaire). As a consequence, relatively few opportunities arose in which shared method variance could be modeled (and potentially controlled). Such residual covariance can inflate estimates of key path coefficients.

Modeling tip. Researchers should carefully select measures to allow for the systematic extraction of shared method variance using SEM. Three kinds of shared method variance can be extracted depending on the sophistication of the measurement model. In Figure 9, we represent three types of shared method variance as correlations between the disturbance terms for measures that use the same method (Kenny & Kashy, 1992). The first type consists of within-trait, cross-wave error covariance (e.g., Path s). Whenever the same measure is administered at more than one point in time, this type of shared method variance almost always exists. When a construct such as X, M, or Y is represented by the same set of measures at each time point, such shared method variance can be modeled by allowing correlations between appropriate pairs of disturbance terms.

An even more sophisticated measurement model could involve the addition of a within-wave MTMM structure. In other words, each construct at a given wave is represented by the same set of methods (e.g., clinical interview, behavioral observation, self-report questionnaire, standardized test, physiological assay, significant other reports). An example of such within-wave covariation is path u in Figure 9. For latent variable SEM to control for shared method variance using the correlated errors approach, the methods must be as dissimilar as possible. If the methods are correlated with one another, an approach such as Widaman’s (1985) use of oblique method factors must be implemented. If the MTMM structure is replicated over time, cross-wave/cross-trait error covariance can be extracted (see Cole, Martin, Powers, & Truglio, 1996, for an example). Path v in Figure 9 represents such covariation. In our experience, these path coefficients are often small. We do not suggest, however, that they can be ignored. That decision will vary from study to study and should be based on sound theoretical and empirical arguments.

Sometimes structural limitations of the model or empirical limitations of the data prevent the inclusion of paths that are clearly justifiable and anticipatable. In such cases, the researcher is forced to omit paths that (theoretically) are not negligible. Whenever nonzero paths are omitted, the estimates of other path coefficients will be biased. Indeed, one can argue that no path is ever truly zero and that some degree of bias is inevitable. If the effect sizes of the omitted paths are small (and the model is otherwise correctly specified), the resulting bias is typically small. Unfortunately, we often cannot truly know the magnitude of paths that could not be included in the model. In such cases, the researcher should frankly and completely discuss the degree to which their omission may have biased the results.

**Summary**

The preceding questions about mediational models bring to light only some of the more common threats to statistical conclusion validity. We hope that our tips suggest possible ways to negotiate these threats. We want to emphasize, however, that this list is far from complete, and the tips are hardly a panacea. The most important ingredients in the application of SEM to questions about mediational processes will always be an awareness of the problems that might exist, the ingenuity to apply the full range of SEM techniques to such problems, and the humility to acknowledge the problems that these methods cannot resolve.

**Structural Equation Modeling Steps**

In this section, we describe five general steps for the use of SEM in testing mediational effects with longitudinal data. Not every research design will allow all five of these steps. For example, Steps 1 and 2 are only possible when X, M, and Y are represented by multiple measures. In Steps 3 and 4, we recommend that there be at least three waves of data (to avoid the problems described in Questions 1 and 3). In all of these tests we require the use of unstandardized data. Standardized data will yield inaccurate parameter estimates, standard errors, and goodness-of-fit indices in many of the following tests (see Cudek, 1989; Steiger, 2002; Willett, Singer, & Martin, 1998). Sometimes features of the data, the design, or the model make it impossible to conduct some of these steps. Under such circumstances, the investigator tacitly makes one or more assumptions, the violation of which jeopardizes the integrity of the results.

Step 1: Test of the Measurement Model

Everything hinges on the supposition that the manifest variables relate to one another in the ways prescribed by the measurement model. If the measurement model does not provide a good fit to the data (see Tomarken & Waller, this issue), we may not have measured what we intended. In such cases, clear interpretation of the structural model is impossible. To conduct this test, we begin with a model in which every latent variable is simply allowed to correlate with every other latent variable. The structural part of this model is completely saturated (i.e., the structural part of this model has zero degrees of freedom; it is just-identified). The overall model, however, is overidentified; it has positive degrees of freedom, which derive from constraints placed on the factor loadings and the covariances between the disturbance terms. Consequently, the test of the overall model tells us nothing about the structural paths, as none of them are constrained. Instead, the test of this model assesses the degree to which either of two types of measurement-related problems might exist.

First, it tests whether the manifest variables relate only to the latent variables they were supposed to represent. If the model provides a poor fit to the data, the problem might be that one or more of the measures loads onto a latent variable it was not anticipated to measure. Second, the model tests whether disturbance terms only relate to one another in the ways that have been anticipated. Whenever two measures share the same method (and certainly when the same measure is administered wave after wave), we urge researchers to allow (and test) correlations among their disturbances. (See Question 7 for examples.) If the model fits the data poorly, the problem might be that some of the manifest variables share nuisance variance in unexpected ways. Sometimes theory-driven modification of the measurement model will improve the fit; however, our opinion is that such post hoc model modification is undertaken far too often. Cavalier, empirically driven modifications frequently result in a model that may provide a good statistical fit at the cost of theoretical meaningfulness and replicability. Ultimately, if the measurement model fails to fit the data, one cannot proceed with tests of structural parameters. Alternatively, if this model fits the data well, it provides a basis (i.e., a full model) against which we can begin to compare more parsimonious structural models. **[10]**

**[10]** When all constructs are measured perfectly (e.g., sex), there may be no need to test a measurement model. In the social sciences, however, perfectly measured constructs are relatively rare, and models containing nothing but perfectly measured constructs are almost nonexistent.

**Step 2: Tests of Equivalence**

The second phase of analysis involves testing the equality of various parameters across waves. The first of these procedures tests examines the equilibrium assumption. The second tests for factorial invariance across waves.

The latent variable system is in equilibrium if the variances and covariances of the latent variables are invariant from one wave to the next. The equilibrium of a system is testable within the context of the measurement model described in Step 1. This test involves the comparison of a previous (full) model to a reduced model in which the variances and covariances of the Wave 1 latent variables are constrained to equal their counterparts at every subsequent wave. (One must fix one factor loading per latent variable to a nonzero constant in order to identify this model.) If the comparison were significant, the equilibrium hypotheses would be rejected. The assumption that the latent variable variances–covariances are constant across wave would not be tenable. Either the causal parameters are not stationary across time, or the causal processes began only recently and have not had time to reach equilibrium.

Factorial invariance implies that the relation of the latent variables to the manifest variables is constant over time. Some longitudinal studies follow children across noteworthy developmental periods (e.g., Kokko & Pulkkinen, 2000; Mahoney, 2000; Waters, Hamilton, & Weinfield, 2000) or track families from generation to generation (e.g., Cairns, Cairns, Xie, Leung, & Hearne, 1998; Cohen, Kasen, Brook, & Hartmark, 1998; Hardy, Astone, Brooks-Gunn, Shapiro, & Miller, 1998; Serbin & Stack, 1998). Over such developmental or historical spans, the very meaning of the original variables can change. Such changes can complicate (if not completely confound) the interpretation of longitudinal results. One way to test for such shifts in meaning is to compare the preceding model with one in which the Wave 1 factor loadings are constrained to equal their counterparts at subsequent waves. (One must standardize the latent variables and release the previous constraint that selected loadings equal a nonzero constant in order to test this model.) When this comparison is significant, some or all of the factor loadings are not constant across waves. The meaning of either the manifest or the latent variables changes over the course of the study. (See Byrne, Shavelson, & Muthen, 1989, for approaches to coping with partial invariance.)

**Step 3: Test of Added Components**

This step tests the possibility that variables not in the model could help explain some of the relations among variables that are in the model. This test involves the comparison of a full model to a reduced model. In the full model, three sets of structural paths are included: (1) Every upstream variable has a direct effect on every downstream latent variable, (2) all exogenous latent variables are allowed to correlate with one another, and (3) all residuals of latent downstream variables are allowed to correlate with one another within each wave. (Although this model looks quite different from the measurement model described in Step 1, the two models are actually equivalent; see Kline, 1998.) The reduced model is identical to the full model except that the residuals of the downstream latent variables are no longer allowed to correlate. Specifically, we compare the full model to a reduced model in which all correlations between residual terms for the endogenous latent variables are constrained to be zero. If this comparison is significant, some of the covariation among the downstream latent variables remains unexplained by the exogenous variables in the model. The existence of such unexplained covariation suggests that potentially important variables are missing from the model. These variables may be confounds such as those depicted in Figure 4. Ignoring such confounds may produce biased estimates of the causal parameters of interest. Identifying and controlling for such variables should become a focus for future research.

In reality, all models are incomplete. The finite collection of predictors included in any given study will never completely explain the covariation among the downstream variables. In other words, some omitted variable inevitably exists that will explain more of the residual covariation. Any failure to detect significant residual covariation is simply a function of power and effect size. The detection of significant (and sizable) residual covariance should lead investigators to search for and incorporate additional causal constructs in their models. Failure to detect significant residual covariation, however, should not be taken as license to abandon this search, nor should it lead investigators simply to drop all correlated disturbances from their models. Such practices perpetuate bias, in the long run as well as in the short run. We recommend that these covariances be estimated, tested, and retained in the model even if they appear to be nonsignificant.

**Step 4: Test of Omitted Paths**

The third test examines the causal paths that are not construed as part of the mediational model of interest. This test involves the comparison of a full model to a reduced model in which selected causal paths are restricted to zero. The reduced model is identical to the full model except all paths that are not part of a longitudinal mediational model have been eliminated. The structural part of the reduced model is the same as that in Model 2 (see Figure 1). If this comparison is significant, we learn that the reduced model is too parsimonious, implying that some of the paths that distinguish it from the full model are significant. Careful, theory-driven follow-up tests may be possible to determine which of these paths must be reinstated. In other words, the mediational model of interest (if it pertains) exists in a system of other causal relations that cannot be ignored without potentially biasing estimates of the mediational paths.

At least three specific follow-up tests are often of particular theoretical interest. One is the test for the possible existence of direct effects of X on Y. The existence of paths that connect X at one time to Y at some subsequent time, without going through M, suggests that M is only a partial mediator of the X->Y relation (at best). Given the complexities of psychological phenomena, we suspect that any single construct M will completely mediate a given relation only rarely, making this follow-up test particularly compelling. The second is the test for the presence–absence of wave-skipping paths. In Model 2, we have included only Lag 1 auto-correlational relations. However, more-complex models are often needed even to represent the relation of a variable to itself over time. The presence of Lag 2 (or greater) paths suggests the existence of potentially interesting nonlinear relations: The system may not be stationary, causal relations may be accelerating or decelerating, or the selected time lag between waves might not be optimal to represent the full causal effect of one variable on another. The third is to test for the presence–absence of “theoretically backward” effects. By this we do not mean effects that go backward in time, but effects that are backward relative to the theory that compelled the study (e.g., Y1->M2, M2->X3, Y1->X3). For decades, psychologists speculated about the causal effect of stressful life events on depression, sometimes to the exclusion of other possibilities, until the emergence of the stress-generation hypothesis by Hammen (1992). Compelling cases can often be made for “reverse” causal models. When the data are at our fingertips, why not conduct the test?

**Step 5: Estimating Mediational (and Direct) Effects**

Estimates and tests of specific mediational parameters can be conducted in the context of any of the preceding models. The optimal choice is the most parsimonious model that provides a good fit to the data. We recommend several steps.

1. Estimate the total effect of X1 on YT . As Baron and Kenny (1986) pointed out, the very idea that M mediates the X-Y relation is based on the premise that the X-Y relation exists in the first place. Thus a logical place to start is with the estimation of the total effect of X at time 1 on Y at time T (where Time 1 and time T represent the beginning and end, respectively, of the time period covered by the study). This effect represents the sum of all nonspurious, time-specific effects of X1 on Y_{T}. This estimate represents the effect a one-unit change in X1 will have on Y_{T} over the course of the study. (One might also be interested in the total effect over only one part of the study, especially if the effect waxes or wanes over time. In such cases, time T represents the end of the interval of interest and not necessarily the end of the study.)

2. Estimate the overall indirect effect. The overall indirect effect of X1 on Y_{T} through M provides a good estimate of the degree to which M mediates the X-Y relation over the entire interval from Time 1 to time T, provided that the waves are optimally spaced and sufficient in number (see Question 4).

This overall indirect effect consists of the sum of all time-specific indirect effects that start with X1, pass through M_{i}, and end with Y_{T}, where 1 < i < T. (Such time-specific effects can also pass through X_{i} or Y_{i}, as long as 1 < i < T.) The number of such time-specific effects depends on the number of waves in the study and the number of added processes that exist (see Step 3 above). A five-wave model with no “added processes” has six time-specific indirect effects, as described under Question 4. We can interpret the overall indirect effect as that part of the total effect of X1 on Y_{T} that would disappear if we were to control for M at all points between Time 1 and time T.

3. Estimate the overall direct effect. The overall direct effect of X1 on Y_{T} is that part of the total effect of X1 on Y_{T} that is not mediated by M. The overall direct effect consists of the sum of all time-specific effects that start with X1 and end with Y_{T}, but never pass through M. Alternatively, the overall direct effect can be computed as the partial correlation between X1 and Y_{T} after controlling for all measures of M that fall between Time 1 and time T. The magnitude of this effect reflects the degree to which M fails to explain completely the X-Y relation. The sum of the overall direct effect and the overall indirect effect will equal the total effect of X1 on Y_{T}.

4. Tests of statistical significance. Statistical tests for the overall indirect effect and the overall direct effect (described above) have not yet been developed. Although Sobel (1982) and Baron and Kenny (1986) have described tests for the indirect effect in cross-sectional designs, they do not extend to the overall indirect effect in multiwave designs. Under most circumstances, however, certain necessary conditions for longitudinal mediation can be tested, even though none of these tests addresses the overall indirect effect itself. For the overall direct effect, there is no necessary condition that can be tested.

Does ab=0? For mediation to exist, the product ab must be nonzero. In the case where only three waves of data exist, the test of ab = 0 is both necessary and sufficient for mediation, as there is only a single tracing (ab) whereby X1 has an impact on Y3: through M2. **[11]** When more than three waves exist (i.e., when T > 3), the overall indirect effect of X1 on Yt through M is also a function of paths x, m, and y. If x, m, and y all equal zero, the overall indirect effect of X1 on Yt through M will be zero, regardless of the value of ab (even though three-wave mediation can exist). Tests of x, m, and y are possible by comparing a mediational model in which x, m, and y are free (e.g., the model described in Step 4 above) to a reduced model in which x, m, and y are constrained to zero. If this comparison is significant, at least one of the three parameters is nonzero (and one is enough if ab is also nonzero).

The test of ab may be accomplished in two ways. The first involves testing a and b separately. For example, the model that derives from Step 4 can be compared to a reduced model in which a = 0 (or b = 0). If a and b are nonzero, their product is also nonzero. A potential downside to this approach may be its relatively low power. Although Monte Carlo studies have not been published on the power of longitudinal mediational designs, MacKinnon, Lockwood, Hoffman, West, and Sheets’s (2002) examination of the cross-sectional case suggests that stepwise approaches tend to have problems with power (in part because the joint probability of rejecting two null hypotheses can be small). For general information on estimating power in SEM, see MacCallum, Browne, and Sugawara (1996) and Muthen and Curran (1997).

Alternatively, one can test the ab product directly. According to Baron and Kenny’s (1986) expansion of Sobel’s (1982) **[12]** calculations, the standard error for the indirect effect ab is (b²s_{a}² + a²s_{b}² + s_{a}²s_{b}²)½ (assuming multivariate normality), where s_{a} and s_{b} are the standard errors for a and b, respectively (see Holmbeck, 2002, for examples). The test is relatively straightforward because estimates of a, b, sa, and sb all derive from traditional SEM procedures. If ab is nonzero, the X->Y relation is at least partially mediated by M (under the stability assumptions described above). Even though the test of ab tends to be more powerful than are the tests of a and b separately, we have to emphasize that ab is not an estimate of the overall mediational effect per se. For that calculation, multiple effects must be summed, as demonstrated under Question 4.

Interestingly, all of these tests are possible when only two waves of data are available. Under certain assumptions, a two-wave model yields immanently testable estimates of all five parameters: a, b, x, m, and y. The trouble is, however, that these assumptions are not testable with only two waves of data. One is the stationarity assumption: All paths connecting Wave 1 to Wave 2 must be identical to those that would have connected future waves, had such data been collected. The other assumption is that the optimal time lag for X to affect M is the same as the time lag for M to affect Y, a test that also requires many waves of data.

Does c = 0? In the cross-sectional case, Baron and Kenny (1986) noted that c’ must equal zero for mediation to be complete. In the longitudinal case, however, no single path must be nonzero for an overall direct effect of X1 on Y_{T} to exist. In Model 5 (see Figure 5), Path c would seem to be such a path. In reality, however, X1 could directly affect Y2 or Y4 or Y_{T} without ever directly affecting Y3. For mediation to be complete, all direct effects of X1 on Yj (where 1 < j ≤ T) must be zero, assuming x and y are nonzero. Testing this hypothesis is possible using methods described under Step 3 (above).

**[11]** This assumes that the waves are separated by the optimal time interval, a requisite condition outlined in Question 4.

**[12]** Most SEM programs that provide a test of the indirect effect use Sobel’s (1982) test, not Baron and Kenny’s (1986) expansion. Sobel’s test assumes that coefficients a and b are uncorrelated, whereas Baron and Kenny’s formula allows for such covariation.

**A Caveat About Parsimonious Models **

The preceding steps are designed to lead us to the most parsimonious model that still provides a good fit to the data. Such parsimony, however, comes at a price. To attain parsimony, we fix paths to zero or constrain parameters to equal one another. In reality, no path is ever exactly zero, and no two paths are ever exactly equal. When we place such constraints, we guarantee that our parameter estimates (and not just those that are constrained) will be biased. We hope that our goodness-of-fit tests prevent us from settling on a model in which the bias is large, but this hope hinges on having powerful goodness-of-fit tests in the first place. An alternative is to remove these constraints, estimate fuller models, and examine the confidence intervals around key parameter estimates.

**Conclusions and Future Directions **

In this article, we have described a number of methodological problems that arise in longitudinal studies of mediational processes. We have also recommended five steps or procedures designed to improve such studies. These steps require that previous research has already revealed the optimal time lag for the longitudinal design. The steps include (1) the careful selection of multiple measures of each construct and the subsequent testing of the intended measurement model, (2) testing for the existence of unmeasured variables that impinge on the mediational causal model, (3) testing for the existence of causal processes not anticipated by the mediational model, (4) testing the assumption of stationarity, and (5) estimating the overall (not time-specific) direct and indirect effects in the appraisal of mediational processes. We recognize that every researcher will not be able to implement all of these procedures in every study. We present these procedures as methodological goals or guidelines, not as absolute requirements. Nevertheless, we recommend that researchers carefully acknowledge the methodological limitations of their studies and formally describe the potentially serious consequences that can result from such limitations. Such candor paves a smoother path for future research.

Still more work remains to be done in the refinement of methodologies for testing mediational hypotheses. One critical area concerns the use of parsimonious versus fuller models for parameter estimation. Estimates derived from parsimonious models typically have smaller variances (a desirable characteristic in a statistic); however, they are also biased. Estimates derived from saturated models will be unbiased; however, they often have larger variances. Substantial work is needed to examine the relative efficiency of parameter estimates based on parsimonious versus saturated models. A second area pertains to the testing of overall indirect effects. A general formula for the standard error of the overall mediational effect has not been developed. The standard error for the product ab, developed by Sobel (1982) and refined by Baron and Kenny (1986), only tests the overall mediational effect in three-wave designs. Although we can estimate overall indirect effects in designs with more than three waves, their statistical significance cannot be tested directly. Shrout and Bolger (2002) demonstrated the virtues of bootstrapping for estimating standard errors and testing mediational effects in cross-sectional designs. Based on the success of the bootstrap in cross-sectional designs, it may hold promise as a method for estimating standard errors and testing hypotheses regarding overall indirect effects in longitudinal designs.

A third area in need of research pertains to the development of methods for determining the optimal frequency with which waves of longitudinal data should be collected. The optimal time lag will no doubt vary from mediational model to mediational model. Indeed, the optimal lag may vary from one part of the mediational model (e.g., X->M) to another part of the same model (e.g., M->Y). Clear procedures are needed to guide researchers as they address this essential preliminary question. A fourth area pertains to the robustness of mediational tests to the violations of methodological recommendations and statistical assumptions. For example, how serious is the bias that results from violations of the assumption of multivariate normality (Finch, West, & MacKinnon, 1997). Clearly, SEM procedures for testing mediational processes are still being refined. Nevertheless, most researchers have neglected the methodological advances in this area that have already been made. For this, the potential consequences may be considerable. A fifth area pertains to the sample size needed to obtain unbiased and efficient estimates of mediational effects. Preliminary work on cross-sectional mediational designs (e.g., the N needed to estimate ab) has only recently been completed (Mac-Kinnon, Warsi, & Dwyer, 1995). Analogous work on longitudinal mediational designs (e.g., the N needed to estimate overall indirect effects) has not yet been conducted.

Finally, we should point out that SEM methods represent only one approach to the study of mediation, and not necessarily even the best approach. Procedures such as multilevel modeling (Krull & MacKinnon, 1999, 2001) and latent growth curve analysis (Duncan, Duncan, Strycker, Li, & Alpert, 1999; Willet & Sayer, 1994) represent alternative methods for the conceptualization and study of change. Recently, a few examples have emerged that apply such methods to questions about causal chains, such as those involved in mediational variable systems. Nevertheless, the most compelling tests of causal and mediational hypotheses derive from randomized experimental designs. In observational–correlational designs, we rely on statistical controls, not on random assignment and experimental manipulation. In randomized experimental approaches to mediation, the investigator may seek to disable or counteract the mediator rather than allowing it to covary naturally as a function of variables, only some of which are measured in the study. In the interest of methodological multiplism, we urge investigators to tackle questions of mediation using a variety of research designs, each of which has its own strengths and limitations.

]]>**1. Introduction.**

According to Jensen (**1998**, chap. 14) the IQ g factor “is causally related to many real-life conditions, both personal and social. All these relationships form a complex correlational network, or nexus, in which g is the major node” (p. 544). There could be some mediation through g between past economic status (e.g., familial SES) and future status. In a longitudinal study, for example, g would be the most (or one of) meaningful predictor of future outcome such as social mobility (Schmidt & Hunter, **2004**), even controlling for SES, as evidenced from sibling studies of the sibling differences in IQ/g (Murray, **1998**). On the other hand, non-g component of IQ tests do not have meaningful predictive validity (Jensen, **1998**, pp. 284-285) in terms of difference in incremental R² (when non-g is added after g versus when g is added after non-g).

The fact that complexity (of occupation) will mediate the relationship between IQ and income (Ganzach et al., **2013**) or job complexity mediating IQ-performance correlation (Gottfredson, **1997**, p. 82) is in line with suggestion that cognitive complexity is the main ingredient in many g-correlates (Gottfredson, **1997**, **2002**, **2004**; & Deary, **2004**). The decline in standard deviation (SD) in IQ scores with increasing occupational level also supports this interpretation (Schmidt & Hunter, **2004**, p. 163) as it would mean that success depends on a minimum cognitive threshold that tends to increase at higher levels of occupation.

Opponents of such interpretation usually argue that IQ tests measure nothing more than knowledge, mostly based on school and experiences, and that the question of complexity is irrelevant. This assumption can’t readily explain individual and group differences on tests that are proven to be impervious to learning and practice and, instead of being dependent on knowledge, merely reflect the capacity for mental transformation or manipulation of the item’s elements in tests that are relatively culture-free (e.g., Jensen, **1980**, pp. 662-677, 686-706; **2006**, pp. 167-168, 182-183). Because the more complex jobs depend less on job knowledge, they have less automatable demands which is analogous to fluid g rather than crystallized g (Gottfredson, **1997**, pp. 84, 97; **2002**, p. 341). Additionally, the declining correlation between experience and job performance over the years (Schmidt & Hunter, **2004**, p. 168) suggests that lengthy experience (e.g., through accumulated knowledge) does not compensate for low IQ. Knowledge simply doesn’t appear as the active ingredient underlying the IQ-success association.

There have been suggestions that covariations between g and social status works through an active gene-environment correlation as individuals increasingly shape their life niches over time (Gottfredson, **2010**). To illustrate this, Jensen (**1980**, p. 322) explains that success itself (e.g., due in part to higher g) acts as a motivational factor magnifying the advantage of having a higher IQ (g) level. Or the reverse when failures accumulate over time.

Gottfredson (**1997**, p. 121) and Herrnstein & Murray (**1994**, ch. 21-22) expected that, over time, the cognitive load of everyday life would tend to increase; Gottfredson believed that to be the inevitable outcome of societies’ modernization, and Herrnstein & Murray propose a complement, which is government’s laws make life more cognitively dependent because of the necessity to deal, cope with the new established laws. Theoretically, this sounds fairly reasonable. Given this, we should have expected IQ-correlates (notably with the main variables of economic success) to have increased over the past few decades. Strenze (**2007**) said there wasn’t any trend at all. Several explanations include range restriction in years (1960s-1990s in Strenze data) which is obvious because there wasn’t probably any drastic change in life within this “short” period of time. Another explanation could be that, over time, all other things are not equal, i.e., other alternative factors fluctuate in both directions and could have masked the relationship between cognitive load and time.

**2. Method.**

I will demonstrate presently that parents’ SES predicts children’s achievement (GPA+educational years) independently of g but also through g (ASVAB or ACT) by use of latent variable approach. Next, I use a latent GPA as independent var.

**2.1 Data.**

Of use presently is the NLSY97. Available **here** (need free account). The variables included in the CFA-SEM model are parental income (SQRT applied in order to respect distribution normality), mother and father education, and GPA english, foreign languages, math, social science, life science, and the 12 ASVAB subtests. All these variables have been age/gender adjusted, except parental grade/income. Refer to the syntax **here**.

To be clear with the ASVAB subtests :

GS. General Science. (Science/Technical) Knowledge of physical and biological sciences.

AR. Arithmetic Reasoning. (Math) Ability to solve arithmetic word problems.

WK. Word Knowledge. (Verbal) Ability to select the correct meaning of words presented in context and to identify best synonym for a given word.

PC. Paragraph Comprehension. (Verbal) Ability to obtain information from written passages.

NO. Numerical Operations. (Speed) Ability to perform arithmetic computations.

CS. Coding Speed. (Speed) Ability to use a key in assigning code numbers to words.

AI. Auto Information. (Science/Technical) Knowledge of automobile technology.

SI. Shop Information. (Science/Technical) Knowledge of tools and shop terminology and practices.

MK. Math Knowledge. (Math) Knowledge of high school mathematics principles.

MC. Mechanical Comprehension. (Science/Technical) Knowledge of mechanical and physical principles.

EI. Electronics Information. (Science/Technical) Knowledge of electricity and electronics.

AO. Assembling Objects. (Spatial) Ability to determine how an object will look when its parts are put together.

More information on ASVAB (**here**) and transcript (**here**).

**2.2. Statistical analysis.**

SPSS is used for EFA. AMOS software is used for CFA and SEM analyses. All those analyses have assumptions. They work best with normally distributed data (univariate and multivariate) and continuous variables. AMOS can still perform SEM with categorical variables using Bayesian approach. Read Byrne (**2010**, pp. 151-160) for an application of Bayesian Estimation.

Structural equation model is a kind of multiple regression which allows us to decompose the correlations into direct and indirect paths among construct (i.e., latent) variables. We can see it as a combination of CFA and path analysis. There is a huge difference between path analysis and SEM in the sense that a latent variable approach (e.g., SEM) has the advantage of removing measurement errors in estimating the regression paths. This results in higher correlations. There is no certainty for minimum N. The golden rule is N>200, but some examples include N of 100 or 150 that work well (convergence and proper solution).

CFA requires continuous and normally distributed data (When the data variables are of **ordinal type** (e.g., ‘completely agree’, ‘mostly agree’, ‘mostly disagree’, ‘completely disagree’), we must rely on polychoric correlations when two variables are categorical or polyserial correlations when one variable is categorical and the other is continuous. CFA normally assumes no cross-loadings, that is, subtests composing the latent factor “Verbal” are not allowed to have significant loadings on another latent factor, e.g., Math. Even so, forcing a (CFA) measurement model to have zero cross-loadings when EFA reveals just the opposite will lead to misspecification and distortion of factor correlations as well as factor loadings (Asparouhov & Muthén, **2008**).

But when there are cross-loadings, it is said that there is a lack of unidimensionality or lack of construct validity, in which case, for instance, parceling (i.e., averaging) items is not recommended as indicators to be used in building a latent factor (Little et al., **2002**). Having two (or more) latent factors is theoretically pointless if the observed variables load on every factors, in which case we should not have assumed the existence of different factors in the first place.

It is possible to build latent factors in SEM with only one observed variable. This is done by fixing the error term of that observed variable to a constant (e.g., 0.00, 0.10, 0.20, …). But even fixing the (un)reliability of that ‘latent factor’ does not means it should be treated or interpreted as a truly latent factor because in this case it is a “construct” by name only.

By definition, a latent variable must be an unobserved variable assumed to cause the observed (indicator) variables. In this sense, he is called a reflective construct, and this is why the indicators have arrows going into them with an error term. The opposite is a formative construct where the indicators are assumed to cause the construct (which becomes a weighted sum of these variables), and this is why the construct itself has an arrow going into him with an residual (error term, or residual variance). This residual represents the “represents the impact of all remaining causes other than those represented by the indicators included in the model” (Diamantopoulos et al., **2007**, p. 16). Hence the absence of residual term is an indication that formative indicators are modeled as to account for 100% of the variance in the formative construct, a very unlikely assumption.

If we want to perform a formative model, a special procedure is needed to achieve identification (MacKenzie et al., **2005**, p. 726; Diamantopoulos et al., **2007**, pp. 20-24). For example, instead of having three verbal indicators caused by a latent verbal, we must draw the arrows from those (formative) indicators to the latent and then fix one of these paths to 1, and each formative construct having at least 2 (unrelated) observed variables and/or reflective latent variables that are caused by this formative construct. Finally, the formative indicators causing the construct should covary (double headed arrow) among them. Given the controversy and ongoing debate surrounding the formative models and by extension the MIMIC (multiple indicator multiple cause) models, I will just ignore this approach. The kind of latent variables used presently are of reflective kind.

**2.2.1. Fundamentals.**

**a) Practice.**

AMOS output usually lists 3 models : 1) default 2) saturated 3) independence model. The first illustrates the model we have specified (actually could be called reduced model, or over-identified), this is the one we are interested in. The second is a (full) model that assumes everything is related to everything (just-identified) with direct path from each variable to each other; it has a perfect fit but this in itself is meaningless. The third assumes that nothing is related to anything, or the null model (no parameters estimated).

In SEM, the model needs to be well specified. Otherwise the model is said to be misspecified (disparity between real-world relationships and model relationships), either because of over-specification (contrain irrelevant explanatory variables) or under-specification (absence of relevant explanatory variables). This can occur when the model specifies no relationship (equal to zero) between two variables when their correlation was in fact non-zero.

Some problems in modeling can even cause the statistical program to not calculate parameters estimates. There can be “under-identification” which causes troubles with the model. This appears when there are more unknown parameters than known parameters, resulting in negative degrees of freedom (df). At the **Amos 20 User’s Guide**, page 103 displays a structural path diagram with the explanation about how to make the model identified.

To better understand the problem of (non)identification, just use an example with 3 variables, X, Y, Z. This means we have 6 pieces of information : the variance for each variables X, Y, and Z, thus 3 variances, plus the covariance of each variables with one another, that is, X with Y, Y with Z, X with Z, thus 3 covariances. Identification problems emerge when the number of freely estimated parameters exceeds the pieces of information available in the actual sample variance-covariance matrix.

Concretely, such model 3 (knowns) – 4 (unknowns; to be estimated) = -1 df is clearly no good because calculation of the model parameters cannot proceed. Thus, more constraints need to be imposed in order to achieve identification. If, on the other hand, the df is equal to zero, the model is said to be “just-identified”, meaning there is a unique solution for the parameters while goodness of fit cannot be computed. In that case, the model fit with regard to the data is perfect. And if df is greater than zero (more knowns than unknows) the model is said to be “over-identified”, which means that there are multiple solutions for at least some of the parameters. The best solution is chosen through maximization of the maximum likelihood function. However, there is no need to worry so much about model identification. AMOS won’t run or calculate the estimates if df is negative. This is seen in the panel left to the “Graphics” space, where the default model would have an “XX” instead of an “OK”.

Remembering what df is, is relevant to the concept of model parsimony and complexity. The more degrees of freedom the model has, and the more parsimonious it is. The opposite is model complexity which can be thought as the greater number of free parameters (i.e., unknowns to be estimated) in a model. Lower df indicates more complex models. Complexity can be increased just by adding more (free) parameters, e.g., paths, covariances, and even cross-loadings. To be sure, freeing a parameter means we allow the unknown (free) parameter to be estimated from the data. Free parameters are the added paths, covariance, or variables (exogenous or endogenous) that do not have their mean and variance fixed at a given value. But doing so has serious consequences on model fit when fit indices penalize for higher model complexity (see below).

In SEM, it is necessary to achieve identification by, among possibilities, fixing one of the factor loadings of the indicator (observed) variables at 1. The (observed, not latent) variable receiving this fixed loading of 1 is usually called a marker variable. It allows the variables to be expressed in the same scale, and thus are seen as standardized variables (with mean of 0, variance of 1). It does not matter which variable is chosen as marker, it does not affect estimates or model fit. Alternatively, we can select the latent factor and fix its variance at 1, the problem however emerges when this latent variable has an arrow going into him because in that case AMOS will not allow the path to be defined unless we suppress the constrained variance of 1 as AMOS requires.

Among other identification problems are the so-called empirical under-dentification, occuring when data-related problems make the model under-identified even if the model is theoretically identified (Bentler & Chou, **1987**, pp. 101-102). For example, latent factors need normally 3 indicators at minimum, and yet the model will not be identified if one factor loading approaches zero. And similarly for a two-factor model that needs the correlation/covariance between the factors to be nonzero.

In SEM, parameters are the regression coefficients for paths between variables but also variance/covariance of independent variables. In the structural diagrams, the dependent or criterion variable(s) is (are) those having (receiving) a single-headed arrow going into them and not starting from them. They are called endogenous variables, as opposed to exogenous (i.e., predictor) variables. The covariances (non-causal relationship) are illustrated as curved double-headed arrows and, importantly, they cannot be modeled among endogenous variables or between endogenous and exogenous variables. Covariances involve only exogenous variables. And these exogenous var. should not have error terms because they are assumed to be measured without error (even though this assumption may seem unrealistic). AMOS would assume no covariance between exogenous var., if there is no curved double-headed arrow linking them or if that arrow has a covariance value of zero (see **AMOS user guide**, p. 61). Either way, this is interpreted as a constraint (not estimated by the program). Likewise, AMOS assumes a null effect of one variable on another if no single-headed arrow linked these two.

Here, the large circles are latent variables, small circles are called errors (e1, e2, …) or residuals in the case of observed (manifest) variables and disturbances (D) in the case of latent variables, and rectangles the observed variables (what we have in our raw data set).

In the above picture, we see the following number related to the disturbance of the latent SES : ‘0.’ … where the number on the left of the comma designates the mean of the error term, which is fixed at zero by default. The number on the right of the comma designates the variance of the error term. The absence of any number indicates we want to freely estimate the error variance; otherwise we can fix it at 0 or 1, or any other number between 0 and 1. But if we want to constrain the error/variable variances to be equal, we can do so by assigning them a single label (instead of a number) identical for all of them (**AMOS user guide**, pp. 43-45).

Again, if we decide not to fix the mean, there should be no number associated to the error term or latent factor. To do this, click on the circle of interest, and specify the number (0 or 1) we want to be fixed, or the number to be removed if we decide not to fix it at a constant.

Above we see the two predictor variables (ASVAB and SES) sharing a curved double-headed arrow, that is, they covary. In this way, we get the independent impact of X1 on Y controlled for the influence of X2, and the impact of X2 on Y controlled for X1. When reporting the significance of the mediation, it is still possible to report the standardized coefficient with the p-value of the unstandardized coefficient.

**b) Methodological problems.**

In assessing good fit in SEM, we must always proceed in two steps. First, look at the model fit for measurement model (e.g., ASVAB models of g) and, second, look at the structural model connecting the latent variables. In doing so, when a bad fit is detected, we can locate where it originates : the measurement or the structural portion of the SEM model.

When it comes to model comparisons (target versus alternative), Tomarken & Waller (**2003**, p. 583) noticed that most researchers do not acknowledge explicitly the plausibility of the alternative model(s) they were testing. If we had two predictors (X) affecting one outcome (Y) variables, and that in the real world, the two predictors are very likely to be related (say, scholastic tests and years of schooling), a model that assumes the two predictors to be uncorrelated is not plausible in real life, and therefore the alternative model has no meaning. The best approach would be to compare two (theoretically) plausible models because a model by definition is supposed to represent a theory having real-life implications.

In SEM, the amount of mediation can be defined as the reduction of the effect of X on Y variable when M is added as mediation. Say, X->Y is labeled path C in the model without M, and X->Y is labeled path C’ in the model with M whereas X->M is labeled A and M->Y is labeled B. And by this, the amount of mediation is the difference between C and C’ (C-C’) and the mediation path AB is equal to C-C’. The same applies for moderator variables; say, variables A, B, AxB (the interaction) and Y in model 1, with variables A, B, AxB, M (the mediation) and Y in model 2, where the difference in the direct path AxB->Y between model 1 and 2 is indicative of the attenuation due to the mediator M. See Baron & Kenny (**1986**) for a discussion of more complex models.

The pattern of a mediation can be affected by what Hopwood (2007, p. 266) calls proximal mediation. If the variable X causes Y through mediation of M variable, X will be more strongly related to M than Y to M if the time elapsed between X and M is shorter (say, 2 days) than between Y and M (say, 2 months). Conversely, when we refer to distal mediation if M is closer (in time) to Y, resulting in overestimation of M->Y and underestimation of X->M.

Hopwood (2007, p. 264) also explains how moderators (interaction effects) and curvilinear functions (effect varying at certain levels of M) can be included in a model. For X and M causing Y, we add an additional variable labeled XM (obtained by simply multiplying X by M, could be done in SPSS, for instance) causing Y. Furthermore, we add two additional variables, squared M and squared XM. The regression model is simply illustrated as follows : X+M+XM+M²+XM²=Y. Here, the XM² represents the curvilinear moderation. Gaskin has a series of videos describing the procedure in AMOS (**Part_7**, **Part_8**, **Part_9**, **Part_10**). See also Kline (**2011**, pp. 333-335). Although one traditional way of testing moderation is by comparing the effect of X on Y separately in the different groups (gender, race, …), if there was a difference in the relationship across the groups, an interaction effect may be thought to be operating. Nonetheless, this method is problematic because variance in X may be unequal across levels of M. In this case, differing effects of X on Y across levels of M are attributed to difference in range restriction among the groups, not moderation effect. Also, if measurement error of X is seen to differ across levels of M, this would be evidence of moderation effect (X->Y varying across levels of M). Latent variable approach (e.g., SEM) hopefully tends to attenuate this concerns. Note that an unstandardized regression coefficient, unlike correlation, is not affected by the variance of independent var. or measurement error in dependent var., and Baron & Kenny (**1986**, p. 1175) recommend their use in this case.

The construction of moderator variables in SEM has the advantage that it attenuates the measurement errors usually associated with the interaction term (Tomarken & Waller, **2005**, pp. 45-46) although their use can have some complications (e.g., specifications and convergence problems). The typical way is to multiply all of the indicators (a1, a2, …) of the latent X by the indicators (b1, b2, …) of the latent M, such as a1*b1, a1*b2, a2*b1, a2*b2 and then by creating the latent variable using all of these interaction variables (see **here**). The same procedure is needed when modeling quadratic effect, e.g., creating a latent variable X² with a1^2, a2^2, and so on. Alternatively, Hopwood (2007, p. 268) proposes to conduct factor analysis with Varimax, to save the standardized factor score, and to create the interaction variable by multiplying these factor score variables.

In some scenarios, measurement errors can be problematic (Cole & Maxwell, **2003**, pp. 567-568). When X is measured with error but not M and Y, the path X->M will be downwardly biased while M->Y and X->Y will be unaffected. For example, with X’s reliability of 0.8, a path of 0.8 assuming perfect reliability will then become 0.8*(SQRT(0.8))=0.72. When Y only is measured with error, only the path M->Y is biased. But when M is measured with error, the (direct) path X->Y is upwardly biased while the other two (indirect) paths are underestimated. Again, the latent variable approach is an appropriate solution for dealing with this problem. And this is one of the great advantage of SEM over simple path analysis.

In a longitudinal design, method shared variance can upwardly bias the correlations between factors when factors are measured by indicators assessed by the same method (or measure) rather than by different methods (e.g., child self-report vs. parent self-report). In the field of intelligence, method must be equivalent to the IQ tests measuring a particular construct (verbal, spatial, …) when the same measure is administered several times at different points in time. But repeated measures (or tests) over time is virtually impossible to avoid in longitudinal data (e.g., CNLSY79), so that such effect must be controlled by allowing correlated errors of same-method indicators across factors. In all likelihood, the correlation (or path) between factors would be diminished because the (upward biased) influence of method artifacts on the correlation have been removed. See Cole & Maxwell (**2003**, pp. 568-569). Several data analysts (Little et al., **2007**; Kenny, **2013**) added that we should test for measurement invariance of factor loadings, intercepts, and error variances, if we want to claim that the latent variable remains the “same” variable at each assessment time. Weak invariance (factors) is sufficient for examining covariance relations but strong invariance (intercepts) is needed to examine mean structures.

The same authors also address the crucial question of longitudinal mediation. Most of the studies use cross-sectional mediation, all the data having been gathered at the same time point. There is no time elapsing between the predictor, mediator and outcome (dependent) variables. Causal inferences are not possible. The choice of variables is important too. Retrospective measures are not recommended, for example. Other measures such as background can hardly be viewed as causal effects because these variables (e.g., education level) don’t move over time. But some others (e.g., income and job level) can move up and down over years.

In some cases, researchers proceed with an half-longitudinal approach, e.g., predictor (X) and mediator (M) measured at time #1, and outcome (Y) at time #2. They argue such approach is not appropriate either, that X->M will be biased because X and M coincide in time and that prior level of M is not controlled. Likewise, when M and Y have been assessed at the same time, but not X, then M->Y will be biased. Ideally, causal models must control for previous levels of M when predicting M and control for previous levels of Y when predicting Y. Using half-longitudinal design with 2 waves, we can still use X1 to predict M2 controlling for M1 and use M1 to predict Y2 controlling for Y1. See Little et al. (**2007**, Figure 3) for a graphical illustration. Although such practice is less bias-prone, we can only test whether M is partial mediator but not if M could be a full mediator. Furthermore, we cannot test for violation of the stationarity assumption. Stationarity denotes an unchanging causal structure, that is, the degree to which one set of variables produces change in another set remains the same over time. For example, if longitudinal measurement invariance (e.g., by constraining a given parameter to equal its counterparts at the subsequent waves) is violated, the stationarity assumption would be rejected.

Of course, the point raised by Cole & Maxwell (**2003**) is not incorrect but what they want is no less : predictor(s) at time1, mediator(s) at time2, outcome at time3. In other words, 3 assessment occasions. Needless to say, this goal is, in practice, virtually impossible to achieve by an independent, small team, with limited resources.

**2.2.2. CFA-SEM model fit indices.**

A key concept of SEM is to evaluate the plausibility of hypotheses modeled through SEM. Concretely, it allows model comparison by way of model fit indices. The best fitted model is selected based on these fit indices. A good fit is indicative that the model reproduces the observed covariance accurately but not necessarily that exogenous and mediators explain a great portion of the variance in the endogenous variables; even an incorrect model can be made to explain the data just by adding parameters to the point where df falls to zero. Associations and effects can be small with good model fit, if residual variances (% unexplained; specificity+error) are sufficiently high as to generate equality in the implied and observed variances (Tomarken & Waller, **2003**, p. 586).

Browne et al. (**2002**) argue that very small residuals in observed variables, while suggesting good fit, could easily result in poor model fit as indicated by fit indices, in factor analysis and SEM. But the explanation is that fit indices are more sensitive to misfit (misspecifications) when unique variances are small than when they are large. Hence, “one should carefully inspect the correlation residuals in E (i.e., the residual matrix, E=S-Σ where s is sample data covariance and Σ fitted or implied covariance). If many or some of these are large, then the model may be inadequate, and the large residuals in E, or correspondingly in E*, can serve to account for the high values of misfit indices and help to identify sources of misfit.” (p. 418). The picture is thus complicated to the extent that all this means we should not rely exclusively on fit indices (with the exception of RMR/SRMR computed using these same residual correlation matrices) because they can be very inaccurate sometimes. Hopefully, the situation where error and specificity are so small as to reflect observed variables as being very high accurate measures of latent variables is likely to be rare.

Unfortunately, the reverse phenomenon is true. In a test of CFA models, Heene et al. (**2011**) discovered that decreasing factor loadings (which consequently increases unique variances) under any of these two kinds of misspecified models, 1) simple, incorrectly assuming no factor correlation, 2) complex, incorrectly assuming no factor correlation and no cross-loading, will cause increasing good model fitting, as a result of altering the statistical power to detect model misfit. In this way, a model would be validated by fit indices (e.g., χ²) erroneously. Even the popular RMSEA, SRMR and CFI are altered by this very phenomenon. Another gloomy discovery is the lack of sensitivity to misspecification (i.e., capability of rejecting models) for RMSEA when the number of indicators per factor (i.e., free parameters) increases, irrespective of factor loadings. Regarding properly specified models, it seemed that χ² and RMSEA are not affected by loadings size while SRMR and CFI could be.

Similarly, Tomarken & Waller (**2003**, p. 592, Fig.8) noted that, in typical SEM models, the power to detect misspecification, holding sample size constant, depend on 1) the magnitude of factor loadings and 2) the number of indicators per latent factors. Put it otherwise, the power to detect any difference in fit between models improves when 1) or 2) increases, but this improvement depends on sample size (their range of N was between 100 and 1000).

There are three classes of goodness-of-fit indices : absolute fit (χ², RMR, SRMR, RMSEA, GFI, AGFI, ECVI, NCP) which considers the model in isolation, incremental fit (CFI, NFI, (TLI) NNFI, RFI, IFI) which compares the actual model to the independent model where nothing is correlated to anything, and parsimonious fit (PGFI, PNFI, AIC, CAIC, BCC, BIC) adjusts for model parsimony (or df). An ideal fit index is one that is sensitive to misspecification but not to sample size (and other artifacts). What must be avoided is not necessarily trivial misspecification but the severe misspecification. Here’s a description of the statistics commonly used :

The χ²/df (CMIN/DF in AMOS), also called relative chi-square, is like a badness of fit, with higher values denoting bad fit because it evaluates the difference between observed and expected covariance matrices. However, χ² is too sensitive. The χ² increases with sample size (e.g., like NFI) and its value diminishes with model complexity due to reduction in degrees of freedom, even if the χ² divided by degrees of freedom overcomes some of its shortcomings. This index evaluates the discrepancy between the actual (and independence) model and the saturated model. The lower, the better. AMOS also displays p-values, but we never see a value like 0.000. Instead we have *** which means it’s highly significant. The p-value here is the probability of getting as large a discrepancy as occurred with the present sample and, so, is aimed to test the hypothesis that the model fits perfectly in the population.

In AMOS, PRATIO is the ratio of how many paths you dropped to how many you could have dropped (all of them). In other words, the df of your model divided by df of the independence model. The Parsimony Normed Fit Index (PNFI), is the product of NFI and PRATIO, and PCFI is the product of the CFI and PRATIO. The PNFI and PCFI are intended to reward those whose models are parsimonious (contain few paths) that is, adjusting for df. What is called the model parsimony is one that is less complex (less parameterized), making less restricted assumptions (think about Occam’s Razor).

The AIC/CAIC, BCC/BIC are somewhat similar to the χ² in the sense that it evalutes the difference between observed and expected covariances, low values are therefore indicative of better fit. Like the χ²/df, they adjust for model complexity. Absolute values for these indices are not relevant. What really is important is the relative index in comparing between models. Despite AIC being commonly reported, it has been criticized due to its tendency to favor more complex models as sample size (N) increases because the rate of increase in the badness of fit term increases with N even though its penalty term remains the same (Preacher et al., **2013**, p. 39). AIC also requires sample size of 200 (minimum) to make its use reliable (Hooper et al., **2008**). Unlike AIC, CAIC adjusts for sample size. For small or moderate samples, BIC/BCC often chooses models that are too simple, because of its heavy penalty (more than AIC) on complexity. Sometimes, it is said that BCC should be preferred over AIC.

The Adjusted (or not) Goodness of Fit Index (GFI, AGFI) should not be lower than 0.90. The higher, the better. The difference between GFI and AGFI is that AGFI adjusts for the downward biasing effect resulting from model complexity. These indices evaluate the relative amount of variances/covariances in the data that is predicted by the actual model covariance matrix. GFI is downwardly biased with larger df/N ratio, and could also be upwardly biased with larger number of parameters and sample size (Hooper et al., **2008**). GFI is not sensitive to misspecification. Sharma et al. (2005) concluded that GFI should be eliminated. Note also that AGFI remains sensitive to N.

The PGFI and PNFI, respectively Parsimony Goodness-of-Fit Index and Parsimony Normed Fit Index, are used for choosing between models. We thus look especially at the relative values. Larger values indicate better fit. PGFI is based upon GFI and PNFI on NFI but they both adjust for df (Hooper et al., **2008**). Both NFI and GFI are not recommended of use (Kenny, **2013**). There is no particular recommended threshold (cut-off) values for them.

The comparative fit index (CFI) should not be lower than 0.90, this value is an incremental fit index which shows the improvement of a given model compared to the baseline (or null) model in which all variables are allowed to be uncorrelated. It declines slightly in more complex models and it is one of the measures least affected by sample size. By way of comparison, the TLI (sometimes called NNFI, non-normed fit index) is similar but it displays a lower index than CFI. Kenny (**2013**) recommends TLI over CFI, the former giving more penalty for model complexity (CFI adds a penalty of 1 for every parameter estimated). MacCallum et al. (**1992**, p. 496) and Sharma et al. (2005) found however that NNFI (TLI) is quite sensitive to sample size but this is a function of the number of indicators. Hooper et al. (**2008**) conclude the same, NNFI being higher in larger samples, just like the NFI. Also, due to its nature of being a non-normed value, it can go above 1.0 and consequently generates outliers when N is small (i.e., ~100) or when factor loadings are small (i.e., ~0.30).

RMSEA estimates the discrepancy related to the approximation, or the amount of unexplained variance (residual), or the lack of fit compared to the saturated model. Unlike many other indices, RMSEA provides its own confidence intervals (narrower lower/upper limits reflecting high precision) that measure the sampling error associated with RMSEA. The RMSEA should not be higher than 0.05, but some authors recommend the 0.06, 0.07, 0.08 and even 0.10 threshold cut-off. The lower, the better. In principle, this index directly corrects for model complexity, as a result, for two models that explain the data equally well, the simpler model has better RMSEA fit. RMSEA is affected by low df and sample size (greater values for smaller N) but not substantially, and can possibly become insensitive to sample size when N>200 (Sharma et al., 2005, pp. 938-939). These authors also indicate that the index is not greatly affected by the number of observed variables in the model, but Fan & Sivo (**2007**, pp. 519-520) and Heene et al. (**2011**, p. 327) reported that RMSEA declines considerably with higher number of observed variables (i.e., larger model size), which is the exact opposite of CFI and TLI that would indicate worse fit (although not always) as the number of indicators increases. In fact, it seems that most fit indices are affected by model types (e.g., CFA models vs. model with both exogenous and endogenous latent variables) to some extent. For this reason, Fan & Sivo (**2007**) argue that it is difficult to establish a flawless cut-off criteria for good model fit. Same conclusion has been reached by Savalei (2012) who discovered that RMSEA becomes less sensitive to misfit when the number of latent factors increases, holding constant the number of indicators or indicators per factor.

The RMR (Root Mean Square Residual) and SRMR (standardized RMR) again express the discrepancy between the residuals of the sample covariance matrix and the hypothesized covariance model. They should not be higher than 0.10. That index is an average residual value calculated using the residualized covariance matrix (equal to the fitted variance-covariance matrix minus sample variance-covariance matrix, and displayed in AMOS through “Residual moments” option) where the absolute standardized values >2.58 are considered to be large (Byrne, **2010**, pp. 77, 86). It is interpreted as meaning that the model explains the correlations to within an average error of, say, 0.05, if the SRMR is 0.05. Because RMR and SRMR are based on squared residuals, they give no information about the direction of the discrepancy. RMR is difficult to interpret because it is dependent on the scale of observed variables, but SRMR corrects for this defect. The fit index for SRMR is lower with more parameters (or if the df decreases) and with larger sample sizes (Hooper et al., **2008**). In AMOS, the SRMR is not displayed in the usual output text with the other fit indices but in the plugins “Standardized RMR”. An empty window is opened. Leave this box opened and click on “Calculate estimates”. The box will display the value of SRMR. But only when data is not missing. So we must use imputations.

The ECVI (Expected Cross Validation Index) is similar to AIC. It measures the difference between the fitted covariance matrix in the analyzed sample and the expected covariance matrix in another sample of same size (Byrne, 2010, p. 82). Smaller values denote better fit. ECVI is used for comparing models, hence the absence of threshold cut-off values for an acceptable model. Like AIC, the ECVI tends to favor complex models as N increases (Preacher, **2006**, p. 235) because more information accrues with larger samples and models with higher complexity can be selected with greater confidence, whereas at small sample sizes these criteria are more conservative.

That being said, all statisticians and researchers would certainly affirm that it is necessary to report the χ² and df and associated p-values, even if we don’t trust χ². In my opinion, I would say it is better to report all indices we can. Different indices reflect different aspect of the model (Kline, 2011, p. 225). We must rely on fit indices but also on the interpretability of parameter estimates and theoretical plausibility of the model.

The rule of thumb values is arbitrary, varying with authors, and for this reason it is not necessary to follow these rules in a very strict manner. Cheung & Rensvold (2001, p. 249) rightly point out that fit indices are affected by sample size, number of factors, indicators per factor, magnitude of factor loadings, model parsimony or complexity, leading these authors to conclude that “the commonly used cutoff values do not work equally well for measurement models with different characteristics and samples with different sample sizes”. Fan & Sivo (**2007**) even add : “Any success in finding the cut-off criteria of fit indices, however, hinges on the validity of the assumption that the resultant cut-off criteria are generalizable to different types of models. For this assumption to be valid, a fit index should be sensitive to model misspecification, but not to types of models.” (p. 527). But if we are concerned with these threshold values, it seems that CFI is the more robust index. To the best of my knowledge, it is the less criticized one.

Finally, keep in mind that fit indices can be effected by missing data rates when a model is misspecified (Davey & Savla, **2010**) although that differs with the nature of the misspecification.

The **AMOS user’s guide** (Appendix C, pp. 597-617) gives the formulas for the fit indices displayed in the output, with their description.

**2.2.3. Dealing with missing values.**

Traditional methods such as pairwise and listwise deletion are now considered as non-optimal ways to deal with missing data in some situations but in some other cases, listwise yields unbiased estimates in regression analyses when the missing values in any of the independent var. do not depend on the values of the dependent var. (Allison, **2002**, pp. 10-12). “Suppose the goal is to estimate mean income for some population. In the sample, 85% of women report their income but only 60% of men (a violation of MCAR, missing completely at random), but within each gender missingness on income does not depend on income (MAR, missing at random). Assuming that men, on average, make more than women, listwise deletion would produce a downwardly biased estimate of mean income for the whole population.” (Allison, **2009**, p. 75). MCAR seems to be a strong assumption and is a condition rarely met in most situations.

Generally, maximum-likelihood (ML) and multiple imputation (MI) are among the most popular and recommended methods. In the MI process, multiple versions of a given dataset are produced, each containing its own set of imputed values. When performing statistical analyses, in SPSS at least, the estimates for all of these imputed datasets are pooled (but some are not, e.g., standard deviations). This produces more accurate estimates than it would be with only one (single) imputation. The advantage of MI over ML is its general use, rendering it useable for all kind of models and data. The little disadvantage of MI over ML is that MI produces different results each time we use it. Another difference is that standard errors must be (slightly) larger in MI than in ML because MI involves a random component between each imputed data sets. The MAR assumption, according to Heymans et al. (**2007**, p. 8) and many other data analysts, cannot be tested, but these authors cited several studies revealing that models incompatible with MAR are not seriously affected (e.g., estimates and standard errors) when multiple imputation (MI) is applied. MI appears to minimize biases. In general, MI is more robust to assumptions’ violation than ML.

A distinction worth bearing in mind is the univariate imputation (single imputation variable) and the multivariate imputation (multiple imputation variables). The univariate version fills missing values for each variable independently. The multivariate version fills missing values while preserving the relationship between variables, and we are mostly concerned with this method because in most cases, data has missing values on multiple variables. We are told that “Univariate imputation is used to impute a single variable. It can be used repeatedly to impute multiple variables only when the variables are independent and will be used in separate analyses. … The situations in which only one variable needs to be imputed or in which multiple incomplete variables can be imputed independently are rare in real-data applications.” See **here**.

A point worth recalling is that the use of the imputed values in the dependent variable has been generally **not recommended** for multiple regressions. This does not mean that the dependent var. should not be included in the imputation procedure. Indeed, if dependent var. is omitted, the imputation model would assume zero correlation between the dependent and the independent variables. Dependent var. should probably be included in the final analysis as well.

On the best practices of multiple imputation, a recommendation is the inclusion of auxiliary variables, which do not appear in the model to be estimated but can serve the purpose of making MAR assumption more plausible (Hardt et al., **2012**). They are used only in the imputation process, by including them along with the other incomplete variables. Because imputation is the process of guessing the missing values based on available values, it makes sense that the addition of more information would help making the “data guessing” more accurate. These variables must not be too numerous and their correlations with the other variables must be reasonably high (data analysts tend to suggest correlations around 0.40) in order to be useful for predicting missing values. The higher the correlations, the better. Some recommended that the ratio of subjects/variables (in the imputation) should never fall below 10/1. Alternatively, Hardt et al. (**2012**) recommend a maximum ratio of 1:3 for variables (with or without auxiliaries) against complete cases, that is, for 60 people having complete data, up to 20 variables could be used. Auxiliary is more effective when it does not have (or fewer) missing values. This is another advantage of MI over ML.

The best auxiliary variables are identical variables measured at different points in time. Probably the best way to use auxiliaries is to find variables that measure roughly the same thing as the variables involved in the final structural models.

Early MI theorists once believed that a set of 5 imputations is well enough to provide us with a stable final estimate. But after more research, others argue now that 20-100 imputations would be even better (Hardt et al., **2012**). Allison (Nov 9, **2012**) affirms that it depends on the % of missing values. For instance, with 10% to 30% missing, 20 imputations would be recommended, and 40 imputations for 50% missing. More precisely, the number of imputation should be more or less similar to the % of missing values; say, for 27% missing, 30 imputations would be reasonable. I don’t have the energy to run the analysis 20 or 40 times for each racial group, so I will limit my imputations to 5 data sets. Thus, in total, the SEM mediation analysis is conducted 18 times (original data + 5 imputed data, for the 3 racial groups) for the ASVAB.

A caveat is that some variables are restricted to a particular group of subjects. For example, a variable “age of pregnancy” should be available only for women and must have missing values for all men, a variable “number of cigarettes per day” available only to people having responded “yes” to the question “do you actually smoke”. The use of imputation could fill the data for men and non-smokers, which makes no sense. A solution is the so-called “conditional imputation” that allows us to impute variables which are defined within a particular subset of the data, and outside this subset, the variables are constant. See **here**.

And still another problem arises when we use categorical variables in the imputation model. Say, the variables can only take on 5 values, 1, 2, 3, 4, 5, like a Likert-type scale (e.g., personality questionnaire). This variable cannot “legally” have a value of, say, 1.6, or 3.3, or 4.8. It may be that linear regression method for imputation will not be efficient in that case. To this matter, Allison (**2002**, p. 56) recommends to round the imputed values. Say, we impute a binary variable coded 0 and 1, and the imputed values above 0.5 can be rounded to 1, below 0.5 rounded to 0. Sometimes, the values can be outside this range (below 0, above 1) but rounding is still possible. With continuous variables, there is no such problems. Even so, a categorical variable having many values, say education or grade level with 20 values, going from 1 to 20, can be (and is usually) treated as a continuous variable. On the other hand, hot deck and PMM imputations can easily deal with categorical variables. If, instead of PMM, we use the SPSS default method “Linear Regression” we will end up with illegal values everywhere. PMM gives the actual values that are in your data set. If your binary variable has only 0 and 1, PMM gives either 0 or 1.

It is also recommended to transform non-normal variables before imputation. Lee & Carlin (**2009**) mentioned that symmetric distribution in the data avoids potential biases in the imputation process.

An important question that has not been generally treated is the possible between-group difference regarding the correlations of the variables of interest. Some variables can be strongly related in a given racial/sex/cohort group but less so in another group. Because the aim of the present article is to compare the SEM mediation across racial group, and given the possibility of racial differences in the correlations, it is probably safer to impute the data for each racial group separately. This is easily done through FILTER function where we specify the value of the race variable to be entered. This method can ensure us that the correlations won’t be distorted by any race or group-interaction effects, even if the obvious drawbacks emerge when we work with small data. Another possibility is to create an interaction variable between the two variables in question and to include it in the imputation model (Graham, **2009**, p. 562). But this would be of no interest if the interaction between the 2 relevant variables is not meaningful.

When we compute estimates from the different imputed data, we can (should) average them. However, we should not average t-, z-, F-statistics, or the Chi-Square (Allison, **2009**, p. 84). For averaging standard errors, say, we have to apply Rubin (1987) formula because a simple “averaging method” fails to take into account the variability in imputed estimations, and those statistics (standard errors, p-values, confidence intervals, …) tend to be somewhat downwardly biased. For this analysis, I have applied Rubin’s formula for pooling SE, CI and the C.R. See attachment at the end of the post.

**a) Multiple imputation on SPSS.**

We go to “Multiple Imputation” option and then go to “Analyze Patterns” and include the variables we wish to use in SEM analysis. We get a chart that reveals the patterns of missing values in the selected variables. My window looks like this.

Initially, the minimum % of missing values is 10, but I choose to lower this value at 0.01. All of the variables supposed to be included in the CFA-SEM models should be selected. Grouping variables like “race” and “gender” are not needed here.

The chart we are interested in is the “Missing Value Patterns”. As we can see, there are clumps or islands of missing and non-missing values (cells) which means that data missingness displays monotonicity (see **IBM SPSS Missing Values 20**, pp. 48-49). Individuals having missing values on a given variable will also have missing values on other, subsequent variables. Similarly, monotonicity is apparent when a variable is observed for an individual, the other, previous variables are also observed for that individual. Say, we have 6 variables, var1 has no missing, 10% people are missing on var2 but also on all other variables, 10% additional are missing on var3 but also on the next variables, 10% additional are missing on var4 but also on var5 and var6, and so on. The opposite is an arbitrary missing pattern which involves impossibility of reordering the variables to form a monotonic pattern. To be sure, here’s an overview of missing data patterns, from Huisman (**2008**) :

Compare with what I get, below. At the bottom of this picture, there seems to be a tendency for monotonicity. But from a global perspective, the justification for a monotone imputation method is not clear to me.

SPSS offers the Fully Conditional Specification (FCS) method, also known as chained equations (MICE) or sequential regression (SRMI), which fits an univariate (single dependent variable) model using all other available variables in the model as predictors and then imputes missing values for the variable being fit. FCS is known as being very flexible, does not rely on the assumption of multivariate normality. FCS uses a specific univariate model per variable, and is a sort of variable-by-variable imputation, i.e., specifying linear regression for continuous var., logistic regression for binary var., ordinal logistic regression for categorical var. In SPSS, the FCS employs Markov Chain Monte Carlo (MCMC) method; further reading, see Huisman (**2011**). More precisely, it employs the **Gibbs Sampler**; such method, or FCS in general, according to van Buuren et al. (**2006**) is quite robust even when MAR assumption is violated, although their study needs to be replicated, as the authors said. The MCMC uses simulation from a Bayesian prediction distribution for normal data. Allison (**2009**) describes it as follows : “After generating predicted values based on the linear regressions, random draws are made from the (simulated) error distribution for each regression equation. These random ‘errors’ are added to the predicted values for each individual to produce the imputed values. The addition of this random variation compensates for the downward bias in variance estimates that usually results from deterministic imputation methods.” (p. 82).

The FCS differs from the multivariate normal imputation method (MVN or MVNI) also known as joint modeling (JM) in which all variables in the imputation model jointly follow a multivariate normal distribution. MVN uses a common model for all variables, multivariate normal for continuous var., multinomial/loglinear for categorical var., general location for a mixture of continuous and categorical var., but encounters difficulty when the included variables have different scale types (continuous, binary, …) whereas FCS can accomodate it. Despite MVN being more restrictive than FCS (e.g., the use of binary var. and categorical var. makes the normality assumption even more unlikely), Lee & Carlin (**2009**) found both methods perform equally well even though their imputation model included binary and categorical variables. Unfortunately, van Buuren (**2007**) found evidence of bias related with JM approach under MVN when normality does not hold. It is not all clear under which condition MVN is robust.

In the present data, I noticed that the selection of monotonic method instead of MCMC is not effective and SPSS send us an error message like “the missing value pattern in the data is not monotone in the specified variable order”. In this way, we can easily choose which method we really need. See Starkweather (**2012**) and Howell (**2012**) for the procedure in SPSS.

When using MCMC, **Starkweather** and **Ludwig-Mayerhofer** have suggested to increase the maximum iteration from the default value of 10 to 100, so as to increase the likelihood of attaining convergence (when the MCMC chain reaches stability – meaning the estimates are no longer fluctuating more than some arbitrarily small amount). That means 100 iterations will be run for each imputation. Because the default value is too low, I wouldn’t recommend the “auto” option. But Allison (**2002**, p. 53) stated however that the marginal return is likely to be too small for most applications to be of concern.

Then, we are given two model types for scale variables, Linear Regression (by default) or Predictive Mean Matching. PMM still uses regression, but the resulting predicted values are adjusted to match the closest (actual, existing) value in the data (i.e. the nearest non-missing value to the predicted value). It helps ensuring that the imputed values are reasonable (within the range of the original data). Given the apparent advantage of PMM (such as, avoiding extreme outliers in imputed values) over the default option, I select PMM with my MCMC imputation method.

The next step is to go to “Impute Missing Data Values” in order to create the imputed data set. All the variables in the model have to be selected. If the missing pattern is arbitrary, we go to Method, Custom, and click on “Fully conditional specification (MCMC)” with max iterations at 100, preferably. Nonetheless, MCMC works well for both monotonic and arbitrary patterns. If we don’t really know for sure about the missing pattern, we can let SPSS decide. It will choose the monotone method for a monotonic pattern and MCMC method otherwise (in this case, the default number of iteration would be too low). Concretely, SPSS (like AMOS) will create a new data set (with new dataset name). The entire procedure is explained in this **webpage** and this **video**. See also **IBM SPSS Missing Values 20** (pp. 17-23).

When using FCS method, we must also request a dataset for the iteration history. After this, we should plot the mean and standard deviation of the imputed values on the iterations, for each variable separately, splitted by the imputation number. See the manual **SPSS missing values 20** (pp. 64-67) for the procedure of FCS convergence charts. The purpose of this plot is to look for patterns in the lines. There should not be any, and they will look suitably “random”.

Last thing, it is extremely important not to forget to change the measurement level of the variables. For example, mom_educ and dad_educ were initially configured as “nominal” while imputation in SPSS only works with “scale” variables, not with “nominal” or “ordinal” measure. We simply have to re-configure these in the SPSS data editor window. Otherwise, SPSS gives such message “The imputation model for [variable] contains more than 100 parameters. No missing values will be imputed.” See **here**.

In the data editor, at the upper right, we see a list of numbers below “Original Data” going from 1 to 5. They represent the number of imputations (5 being the default number on SPSS) because it’s a multiple (not single) imputation. The data has been imputed 5 times, each time giving us with different imputed values. If we had 50 individuals cases (i.e., 50 rows), clicking on 1 will bring us to the 51th row and the 50 following rows, clicking on 2 will bring us on the 101th row and the 50 following rows, and so on. Finally, the yellow shaded cells show the newly created values.

This new dataset is very special. If we perform an analysis using the new dataset, for example a mean comparison of GPA score with gender group, we will be given a table divided into 7 parts. First, we have the mean value (and other stats requested on SPSS) for the original data, second, the mean for each of the five imputed data, and finally the mean for the pooled of the five imputed data. And the pooled (i.e., aggregated) result is the only one we are interested in when performing analyses on SPSS, unless we want to apply Rubin’s formula. We can also use the SPLIT FILE function with the variable “imputation_” in order to get separate results for each stacked data.

There is another method, called Expectation-Maximization, which overcomes some of the limitations of the mean and regression substitution methods such as the preservation of the relationship between variables and lower underestimation of standard errors, even if it is still present in EM. It proceeds in 2 steps. The E-step finds the distribution for the missing data based on the known values for the observed data and the current estimates of the parameters; and the M-step substitutes the missing data with the expected values. This two-step process iterates (default value=25 in SPSS) until changes in expected values from iteration to iteration become negligible. EM assumes a multivariate normal model of imputation and has a tendency to underestimate standard errors due to the absence of a component of randomness in the procedure. Unfortunately in SPSS, there is no way we can create a clustered data file with EM method. We can only create several separate files.

To perform EM, go to “Missing Value Analysis” and put the ID variable in Case Labels, put the variables of interest in Quantitative or Categorical Variables depending on the nature of the variable (e.g., scale or nominal) and select the “EM” box. There is also an EM blue button which we need to look at, and then we click on “save completed data” and name the new data set. When we click on OK, the output provides us with several tables, notably “EM means”, “EM covariances”, “EM correlations” with the associated p-value from χ² statistics. This is a test of the MAR assumption known as Little’s Missing Completely at Random (MCAR) Test. Because the null hypothesis was that the data are missing completely at random, a p-value less than 0.05 indicates violation of MCAR. As always, the χ² is sensitive to sample size and with large N we are likely to have a low p-value, easily lower than 0.05 due to high power of detecting even a small deviation from the null hypothesis, as I noticed when performing this test which appears completely worthless.

Another method, initially not included in SPSS package, is the Hot Deck. This is the best-known approach to nonparametric imputation (although it does not work well with continuous/quantitative variables) and this means it avoids the assumption of data normality. The hot deck procedure sorts the rows (i.e. respondents) of a data file within a set of variables, called the “deck” (or adjustment cells). The procedure involves replacing missing values of one or more variables for a nonrespondent (donees) with observed values from a respondent (donors) that is similar to the non-respondent with respect to characteristics observed by both cases; for example, both of them have variables x1, x2, x3, x4 with, say, respectively, values of 6, 5, 4, 4, but the donor has x5 with value of 2 while the donee has no value in x5, and so the hot deck will give him a value of 2 as well. In some versions, the donor is selected randomly from a set of potential donors. In some other versions, a single donor is identified and values are imputed from that case, usually the nearest neighbor hot deck. Among the benefits of hot deck, the imputations will not be outside the range of possible values. This method performs less well when the ratio of variables (used for imputation process) to sample size becomes larger, and should probably be avoided in small data because it seems difficult to find donors with good matching in small data, and has also the problem of omitting random variations in the imputation process. Even when multiple imputation is performed, the variance in the pooling process is still underestimated although some other procedures (e.g., Bayesian Bootstrap) can overcome this problem (Andridge & Little, **2010**). Myers (**2011**) gives the “SPSS Hot deck macro” for creating this command for SPSS software. Just copy-paste it as it is. See also figure 3 for illustration of the procedure.

The PMM (semi-parametric) method available in SPSS resembles hot deck in many instances but because it uses regression to fill the data, it assumes distributional normality but has the advantage that it works better for continuous variables. It still adds a random component in the imputation process, preserves data distribution and multivariate relationship between variables. Andridge & Little (**2010**) argue about the advantages of PMM over hot decks.

Like hot deck, PMM is recommended for monotone missing data, and it works less well with an arbitrary missing pattern, which is probably what we have presently. It is possible that the method I used is not optimal, and in this case I should have selected MCMC with linear regression method, not PMM. Nonetheless, regarding the bivariate correlations of all the variables used, the pooled average does not diverge so much from the original data. Most often, the cells differs very little, with correlations differing by between ±0.000 and ±0.015. In some cases, the difference was ±0.020 or ±0.030. Exceptionally, some extreme cases have been detected, with cells where the difference was about ±0.050. Even my SEM analyses using MI look similar to those obtained with ML (see below). Moreover, linear regression and PMM produce the same results (see XLS). So generally, it seems that PMM does not perform badly in preserving the correlations.

**b) Multiple imputation on AMOS.**

In AMOS graphics, one needs to select and put the selection of variables we want to impute. Unlike SPSS, AMOS can impute latent variables. All it needs is to construct the measurement model with the observed and latent var.

In Amos (see, **User’s Guide**, ch. 30, pp. 461-468), three options are proposed : 1) regression imputation, 2) stochastic (i.e., aleatory) regression imputation (also used in SPSS), 3) Bayesian imputation. The first (1) fits the model using ML and then the model parameters are set equal to their maximum likelihood estimates, and linear regression is used to predict the unobserved values for each case as a linear combination of the observed values for that same case. Predicted values are then plugged in for the missing values. The second (2) imputes values for each case by drawing randomly from the conditional distribution of the missing values given the observed values, with the unknown model parameters set equal to their maximum likelihood estimates. Because of the random element in stochastic regression imputation, repeating the imputation process many times will produce a different dataset each time. Thus, in standard regression, the operation looks X^_{i}=b_{0}+b_{1}Z_{i} while in the stochastic we have X^_{i}=b_{0}+b_{1}Z_{i}+u_{i} where u_{i} is the random variation. The addition of the random error avoids the underestimation of standard errors. The third (3) resembles stochastic except that it considers model parameters as stochastic variables and not single estimates of unknown constants.

The first option 1) is a single imputation and should not be used. 2) and 3) propose multiple imputations, could be configured using the box “number of completed dataset” which has a default value of 5. It designates the number of imputed data we want to generate before averaging them. Make sure that the box “single output file” is checked. Because checking “multiple output file” will create 5 different datasets, one for each imputation completed. With single file, the datasets will be stacked together with a variable “imputation_” having 5 values; say, we have data for 50 subjects, thus, rows 1-50 belong to the value 0 in the variable “imputation_”. Then, rows 51-100 belong to 1 in the variable “imputation_”, rows 101-150 belong to 2, and so on. While multiple files create 5 files with each 50 rows, the single file creates 1 file with 250 rows.

Byrne (**2010**, p. 357) while arguing that mean imputation method is flawed said that the regression imputation also has its limitations. Nonetheless, stochastic regression imputation is still superior than regression imputation. This seems to be the best option available in AMOS. But it is probably not recommended to conduct imputation with AMOS especially when SPSS proposes more (flexible) options.

Now, we can look at the newly created data, which only contains the variables we need for our CFA-SEM analysis. Obviously, it lacks the grouping variables (e.g., “race” in the present case). We can simply copy-paste the variable’s column from the original data set to this newly created data set, or to merge these data sets (->Data, ->Merge Files, ->Add Variables, move the “excluded” variables into the “key variables” box, click on ‘Match cases on key variables in sorted files’ and ‘Both files provide cases’ and OK). When this is done, save it in a computer’s folder before closing the window. There is no need to be concerned by the possibility that the ID numbers (i.e., raws) could have been rearranged in the process of imputation, that is, IDs are no longer ordered in the same way. In fact, SPSS/AMOS keeps the row values in the original order. So, copy pasting does not pose any problem.

**2.2.4. Distribution normality and Data cleaning.**

**a) Checking variables’ normality.**

Violation of data normality, whether univariate or multivariate, is a threat even for SEM. This can cause model misfit, although not systematically so (Hayduk et al., **2007**, p. 847). And even more. Specifically, skewness affects test of means whereas kurtosis affects test of variances and covariances (Byrne, **2010**, p. 103). We can check univariate normality by looking at skewness and kurtosis in SPSS. Both skewness and kurtosis should be divided by their associated standard error, a (positive or negative) value should not be greater than absolute 1.96 in small samples, no greater than absolute 2.58 in large sample (200 or more) but in very large sample this test must be ignored (Field, **2009**, pp. 137-139). Indeed, when sample size increases, the standard error decreases. For example, the distribution normality of GPA_overall was perfect in the black sample (N=1132, in the original data set) with respect to the histogram, but skewness and its standard error had values of -0.279 and 0.073, in other words a ratio of -3.82, which clearly departs from an absolute value of 2.58. Nevertheless, a rule of thumb is always arbitrary and, in our case, with samples generally larger than 1000, we don’t use this test.

Now, the concept of multivariate normality is somewhat different. The univariate (outlier) is an extreme score on a single variable whereas multivariate (outlier) is an extreme score on two or more variables. Byrne (**2010**, p. 103) informs us that univariate is a necessary but not sufficient condition for multivariate normality. In AMOS, we can easily display the “tests for normality and outliers” in the “Analysis Properties” box. Unfortunately, AMOS refuses to perform this test when we have missing values, a condition that applies to all (survey) data sets. In his **website**, DeCarlo gives SPSS macro for the multivariate test of skew and kurtosis, but I don’t know how that works.

Hopefully, it is still possible to deal with missing data in AMOS by using the “multiple imputation” method. This consists in replacing missing values by substituted, newly created values based on information from the existing data. At the end, we can assess multivariate normality.

We are given the skew and kurtosis univariate values (with their C.R.) for each observed variables. The absolute univariate kurtosis value greater than 7.0 is indicative of early deviation from normality (Byrne, **2010**, pp. 103-104), but a value of 10 is problematic and larger than 20 it becomes extreme (Weston & Gore, **2006**, p. 735). An absolute univariate skew value should be no greater than 2 (Newsom, **2012**) and a value larger than 3 is extreme. With regard to the C.R. (skewness or kurtosis divided by standard error of skewness or kurtosis), it should not be trusted, because of the standard error being too small in large sample sizes. At the very least, we can still compare the C.R. among the variables even if it would be more meaningful to simply look at the univariate kurtosis value. Concerning the multivariate kurtosis critical ratio, C.R., also called Mardia’s normalized coefficient because it is distributed as a z-test, it should not be greater than 5.0. Concerning the multivariate kurtosis, or simply Mardia’s coefficient of multivariate kurtosis, Bollen (**1989**) proposes that if the kurtosis value is lower than p*(p+2), where p is the number of observed variables, then there is multivariate normality.

AMOS test for normality also gives us the Mahalanobis d-squared values (i.e., the observations farthest from the centroid) for each individual case. The values are listed and start from the highest d² until the lowest d² is reached. A value that is largely distant from the next other value(s) is likely to be a strong outlier at either one of the ends. An illustration that worths 1000 words is given in Byrne (**2010**, p. 341). As always, the p values associated with the d² values are of no value. With large sample sizes, they will always be 0.000. Gao et al. (**2008**) mention the possibility of deleting outliers to help achieving multivariate normality but that a large amount of deleted cases would hurt generalizability.

When AMOS lists the number associated with the individual case, this number does not refer to the ID value of the survey data, but the case sorting from lowest ID to highest ID. The column in blue, below :

The mismatch between ID and the case sorting is probably due to the syntax I used for creating the imputed data (filter by race variable). So, for example, the case value (outlier) 1173 suggested by Mahalanobis distance, in my imputed dataset, refers in reality to the number in the blue column, not 7781 under the variable column R0000100 that represents the ID number in NLSY97. Because of this, we need to be careful. Delete outliers by beginning from the highest case value, to the lowest case value. Otherwise, the case ordering would be re-arranged.

For the univariate test, in SPSS, I have used a complete EXPLORE (in ‘descriptive statistics’ option) analysis for all the variables. It shows skewness and kurtosis values with their respective standard error, and as we noted earlier, this test can be ignored. It is probably more meaningful to look at the normal Q-Q plot (violation of normality is evidenced if the dots deviate from the reference fitted line) and detrended normal Q-Q plot (which has the same purpose but just allows us to check the pattern from another angle). Finally, EXPLORE will display boxplots with individual cases that could be outliers. Also displayed as test of normality is the Kolmogorov-Smirnov but this test is flawed and should probably never be used (Erceg-Hurn & Mirosevich, **2008**, p. 594).

Regarding ASVAB subtests, normality holds, although AO departs (but not too seriously) from normality. Generally, when the dots depart from the reference fitted line, they do it at the extremities (lower or upper) of the graph. The same applies for Mom_Educ, Dad_Educ and SQRT_parental_income, with no serious violation of normality. Given the boxplots, there was no strong outliers in any of these variables, except Mom_Educ and Dad_Educ (in the black sample only) because there were too much people with 12 years of education and few people with less or more than 12 years, which produced a histogram with an impressive peak at the middle. In any case, this general pattern is quite similar for black, hispanic and white samples. We can conclude there is no obvious deviation from normality given those tests.

Before leaving this topic, always keep in mind that EXPLORE analysis must be performed with option “exclude cases pairwise” or otherwise a large amount of cases in all of the selected variables could be ignored in the process.

**b) Choosing SEM methods.**

Parameters (i.e., error terms, regression weights or factor loadings, structural (path) coefficients, variance and covariance of independent variables) are estimated with maximum likelihood (ML) method, which attempts to maximize the chance (probability) that obtained values of the dependent variable will be correctly predicted. For a thorough description, go **here**. ML is chosen because it is known to yield the most accurate results. Other methods include Generalized Least Squares (GLS) and unweighted least square (ULS; requires that all observed variables have the same scale) which minimizes the squared deviations between values of the criterion variable and those predicted by the model. Both ML and GLS assume (multivariate) distribution normality with continuous variables, but other methods like scale-free least squares, and asymptotically distribution-free (ADF), do not assume normality of data. ML estimates are not seriously biased when multivariate normality is not respected (Mueller & Hancock, **2007**, pp. 504-505), if sample size is large enough, or “(if the proper covariance or correlation matrix is analyzed, that is, Pearson for continuous variables, polychoric, or polyserial correlation when categorical variable is involved) but their estimated standard errors will likely be underestimated and the model chi-square statistic will be inflated.” (Lei & Wu, **2007**, p. 43).

Of use presently is ML, and then Multiple Imputation (MI) for comparison matter, because we have incomplete data (the total N subjects having SES parental income and education is much lower than total N subjects having ASVAB) and because AMOS does not allow other methods to run when we have missing values. The ML method, consisting in producing estimate for parameters of incomplete variables based on information of the observed data, yields unbiased estimates under the assumption that data is missing at random (MAR; missingness correlated with observed scores but not missing scores) or completely at random (MCAR; missingness not correlated with either observed or missing scores). Hopefully, it seems that ML tends to reduce bias even when MAR condition is not fully satisfied (Byrne, **2010**, pp. 358-359). It is in fact the least biased method when MAR is violated. To note, MAR and MCAR are usually called ignorable missing whereas MNAR is called non-ignorable missing. On AMOS, clicking on “estimate means and intercepts” on “Analysis Properties” gives us the full information ML, or FIML.

**2.2.5. Bootstrapping in structural equation modeling.**

As far as I know, this technique is rarely used in research papers. Bootstrap consists in creating a sampling distribution that allows us to estimate standard errors and create confidence intervals without satisfying the assumption of multivariate normality (note : a path having confidence interval that includes zero in its range would mean that the hypothesis of nonzero value for this parameter is rejected). To be clear, imagine our original sample has n=500 cases or subjects. We use bootstrap to create, say, 1000 samples consisting of n=500 cases by randomly selecting cases with replacement from the original sample. In those bootstrap samples of same size, some individuals can be selected several times, and some others less often, or maybe not at all. This causes each sample to randomly depart from the original sample. In a normal sample (given n=10) we may have X_mean = (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10) / 10. Whereas in any single bootstrap resampling we may have X_mean = (x1, x2, x3, x3, x3, x3, x7, x8, x8, x10) / 10. Or any other combination. In the former, one case can only result in one draw. In the latter, one case can result in several draws. Given 1000 bootstrap samples, we will get the average mean value across the 1000 samples.

Besides bootstrap approach, some statistics have been used to evaluate the significance of a mediation effect. Sobel test is sometimes reported but if it assumes normality, oftentimes violated, such test may not be accurate (Cheung & Lau, 2008). Another statistic is the Satorra-Bentler scaled Chi-Square that controls for non-normality, “which adjusts downward the value of χ² from standard ML estimation by an amount that reflects the degree of kurtosis” (Kline, **2011**, p. 177) although Chi-Square based statistics are known to be sensitive to N. AMOS does not provide S-B χ² but bootstrap is even more effective in dealing with non-normality. See **here**.

Bootstrap great advantage is to be a non-parametric approach, meaning that it deals with non-normality of the data. It seems more reliable with large sample size and generally is not recommended with N<100. See Byrne (**2010**, ch. 12, pp. 329-352) for a detailed demonstration using AMOS. Note that Byrne informs us that bootstrap standard errors seem to be more biased than those from the standard ML method when data is multivariate normal, but less biased when data is not multivariate normal.

There exists two types of bootstrapping. The naïve method, for obtaining parameter estimates, confidence intervals, standard errors, and the Bollen-Stine method, for obtaining the p value for the model fit statistics. Concerning the p value for model fit, Sharma & Kim (**2013**, pp. 2-3) noted that bootstrap samples sometimes don’t represent the population. Under the naïve approach, the mean of the bootstrap population (i.e., the average of the observed sample) is not likely to be equal to zero. In this case, bootstrap samples are drawn from a population for which the null hypothesis (H_{0}) does not hold regardless of whether H_{0} holds for the unknown population from the original sample was drawn. Hence the bootstrap values of the test statistic are likely to reject H_{0} too often. The Bollen-Stine method circumvents this problem by centering the mean of the bootstrapped population to be zero.

If Bollen-Stine provides correct p-values for χ² statistics to assess the overall model fit, remember that Bollen-Stine p-value is still inflated by larger sample sizes. We don’t need to take χ² too seriously. When we use AMOS to perform B-S bootstrap, we are provided with an output like this :

The model fit better in 1000 bootstrap samples.

It fit about equally well in 0 bootstrap samples.

It fit worse or failed to fit in 0 bootstrap samples.

Testing the null hypothesis that the model is correct, Bollen-Stine bootstrap p = .001

This Bollen-Stine p-value is calculated as how many times the model chi-square for bootstrap samples is higher (i.e., “fit worse”) than the observed data chi-square, or (0+1)/1000=0.001 if we use 1000 samples (see **here**). Non-significant value (>0.05) means good model fit. As always, p-value is highly sensitive to sample size, so this number in itself is not very useful. We just need to focuse on the naïve method.

AMOS provides us with bias-corrected confidence intervals; this approach is more accurate for parameters whose estimates’ sampling distributions are asymmetric. The “bias corrected CI” and “Bollen-Stine bootstrap” can be both selected but AMOS won’t estimate both on the same run. Just choose one of them. The main problem is that AMOS won’t run the bootstrap if we have missing values. Therefore, I apply the multiple imputation, and then bootstrapping each of the 5 imputed data sets. That means running the analysis 5 times.

**2.2.6. Modification indices and correlated residuals.**

There is also the possibility to improve model fit through Modification Indices in AMOS. The program produces a list of all possible covariances and paths that are susceptible to improve model fit. This means that adding a path or a covariance will greatly reduce the discrepancy between the actual model and the data. But generally, with survey data, we have missing values (i.e., observed cases). AMOS won’t run even if ML method is chosen. Therefore, we can use multiple imputation. Or we can use a sort of listwise deletion. SPSS can create variables having no missing values. This is easily done using the command DO IF (NOT MISSING(variable01)) AND (NOT MISSING(variable02)) … END DO IF. Suppose we want data for all of the 12 ASVAB subtests. We compute a new subtest variable on the condition that it also contains data for all the remaining subtests. Alternatively, if we want to do SEM analyses using parental income, education, occupation, and other demographic variables, we need to add all these variables in the DO IF (NOT MISSING(variable_x)) command. We do this for each variables, including all other variables we need under the “not missing” condition.

The question of whether researchers should allow correlated residuals in SEM models is debated. According to **Hermida et al.**, theoretical justification for such practice is generally not provided. The probability of correlating the residuals seems to be related with the failure to attain sufficient model fit (e.g., RMSEA larger than 0.10 or CFI lower than 0.90). Presently in the case of NLSY-ASVAB, Coyle (**2008**, **2011**, **2013**) has used the method consisting in correlating measurement errors (or residuals) even among different latent factors. He argues that it improves the model fit. This is not surprising if making the model more complex by adding more covariances or paths will approximate at the end the model fit in a saturated model.

When adding parameters according to what modification indices suggest, I notice indeed a non-trivial model fit improvement, but unless there is a reasonable theoretical reason of doing this, I allow the residuals to be uncorrelated. Otherwise it would likely cause identification problems regarding the latent(s) factor(s) involved when we end up with good model fit but also with latent variables which do not accurately represent the theoretical constructs we were aiming to build. Besides, even within a single latent factor, the correlated residuals can suggest the presence of another (unmeasured) latent factor, and hence multidimensionality, because the key concept of correlated errors is that the unique variances of the indicator variables measure and share something in common other than the latent factor they belong to. But this, alone, can justify correlated errors. Nonetheless, if these correlations have no theoretical or practical significance, they should be ignored. These secondary factor loadings (or error term correlations) can be called secondary relationships. When these secondary relationships are excluded from the model, we refer to parsimony error, i.e., those excluded (non-zero) relationships having no theoretical interest. See Cheung & Rensvold (2001, p. 239).

MacCallum et al. (**1992**) explain that most researchers do not offer interpretations for model modification. They even cite Steiger (1990) who suspected that the percentage of researchers able to provide theoretical justification for freeing a parameter was near zero. They also point out the common problem of allowing covariance among error terms. Apparently, researchers tend to have it both ways : they want well-fitted model but not the responsibility of interpreting changes made to achieve this good fit.

Lei & Wu (**2007**) explain the needs to cross-validate this result and, to illustrate, provide some tests based on model modification. As demonstrated by MacCallum et al. (**1992**), results obtained from model modifications tend to be inconsistent across repeated samples, with cross-validation results behaving erratically. Such outcome is likely to be driven by the characteristics of the particular sample, and so, the risk of capitalization on chance is apparent. This renders replication very unlikely.

Another danger is the ignorance of the impact of correlated residuals on the changes in interpretation of the latent factors measured by the set of observed variables having their errors correlated (or not). Cole et al. (2007, pp. 383-387) discuss this issue and present an example with a latent variable named “child depression” measured by one (or two) child-report (C) measures and two (or one) parent-report (P) measures, which means that child depression is assessed via 2 different methods but of unequal number of variables per method. Controlling for shared method variance changes the nature of the latent factor, as they wrote “In Model 1a, the latent variable is child depression per se, as all covariation due to method has been statistically controlled. In Model 2a, self-report method covariation between C1 and C2 has not been controlled. The nature of the latent variable must change to account for this uncontrolled source of covariance. The larger loadings of C1 and C2 suggest that the latent variable might better be called self-reported child depression. In Model 2b, parent-report method covariation between P1 and P2 has not been controlled. Consequently, the loadings of P1 and P2 increase, and the latent variable transforms into a parent-reported child depression factor.” (p. 384). They conclude that “In reality, it is the construct that has changed, not the measures.” (p. 387). In other words, when omitting to control for method covariance, we get a biased latent factor, favorably biased towards the methods for which we have relatively more variables. In Cole et al. (2007, p. 386), we see from their Figure 3, Model 3a versus 3b, that the observed variables having their errors correlated have seen their loadings on the factor diminished. This was also what I have noticed in my ACT-g, after controlling for the residuals of ACT-english and ACT-reading.

To conclude, Cole et al. (2007) are not saying we should or we should not correlate the error terms, but that we must interpret the latent factor properly, according to the presence or absence of correlated errors.

Given this discussion, I allow MOM_EDUC and DAD_EDUC residuals to be correlated. But maybe it is not necessarily the best move to give more weight on family income, in doing so, because income has less reliability, varying much more over time, let alone the fact that education level does no reflect perfectly the amount of earnings in real life. With regard to ACT-g, I also assume that the variance of ACT-eng and ACT-read not caused (i.e., the residuals) by the latent ACT factor is not unique to either of these variables, so their residuals have been correlated.

**2.2.7. Heywood anomaly.**

The so-called “Heywood case” manifests itself in out-of-range estimates. For example, a standardized value greater than +1 or lower than -1. A negative error variance is also seen as Heywood case. Outside the SEM topic, we can also detect some Heywood cases when conducting factor analysis (e.g., Maximum-Likelihood) with communalities greater than 1. In this situation, the solution must be interpreted with caution (SPSS usually gives a warning message). When some factor loadings in these factor analyses were larger than 1, it can be considered as Heywood cases.

Negative (error) variances may occur with smaller samples, for example, because of random sampling variability when the population value is near zero (Newsom, **2012**). When the R² becomes greater than 1.00, it means the error variance is negative (AMOS output “Notes for Model” shows us when it encounters improper solutions with, e.g., negative error variances). This could be reflected by overcorrection for unreliability, which leaves too little variance to be explained in the construct. In constrast, a zero error variance in an endogenous variable must imply that a dependent variable is explained perfectly by the set of predictors. When standardized regression weights are out of normal bounds, that could be a sign that two variables behave as if they were identical. A very high correlation between factors may occur when many observed variables have loadings on multiple factors (Bentler & Chou, **1987**, p. 106). Generally, improper solution can result from outliers or influential cases, or violation of regression assumptions (e.g., heteroscedasticity), and even with small sample sizes (Sharma et al., 2005, p. 937) combined (or not) with small factor loadings. See also Bentler & Chou (**1987**, pp. 104-105).

Lei & Wu (**2007**) stated the following : “The estimation of a model may fail to converge or the solutions provided may be improper. In the former case, SEM software programs generally stop the estimation process and issue an error message or warning. In the latter, parameter estimates are provided but they are not interpretable because some estimates are out of range (e.g., correlation greater than 1, negative variance). These problems may result if a model is ill specified (e.g., the model is not identified), the data are problematic (e.g., sample size is too small, variables are highly correlated, etc.), or both. Multicollinearity occurs when some variables are linearly dependent or strongly correlated (e.g., bivariate correlation > .85). It causes similar estimation problems in SEM as in multiple regression. Methods for detecting and solving multicollinearity problems established for multiple regression can also be applied in SEM.” (p. 36). See also Kenny (**2011**).

If Heywood cases have been caused by multicollinearity, the estimates are still unbiased but the relative strength of independent variables becomes unreliable. For instance, in the standardized solution, the estimates were assigned a metric of 1, which means that regression paths range from -1;+1 but when we have multicollinearity, the regression weights between some two variables should be close to unity. If those variables are used to predict another variable, separate regression weights are difficult to compute and we may result with out-of-range estimate values. Also, these paths would have larger standard errors and covariances. It is even possible that the variance of the dependent (endogenous) variable is negative due to this.

Byrne (**2010**, 187-192) shows an empirical example where some standardized path regressions are out-of-range -1;+1. As Byrne declares, the very high factor correlation (0.96) between the latent variables in question could have been the explanation. Byrne proposes two solutions : 1) completely delete one factor with the related observed variables 2) instead of having 2 sets of observed variables loaded on two separate factors, form a new factor with all these observed variables loaded on this single factor.

Nonetheless, if measurement error attenuates correlations (e.g., among exogenous var.) and if (SEM) latent variable approach controls for them, it follows that SEM makes multicollinearity to appear where it previously has not been a problem (Grewal et al., **2004**).

Grewal et al. (**2004**) also stated that high reliability of variables can tolerate high multicollinearity, and also that correction for attenuation (or measurement error) will not increase the correlation between latent constructs to unacceptably high levels. Hopefully, the measures used presently are highly reliable. Furthermore, they add : “It should be noted, however, that high levels of multicollinearity may be less apparent in SEM than regression. Because of the attenuation created by measurement error, the cross-construct correlations among indicators will be lower than the actual level of multicollinearity (i.e., correlations among the exogenous constructs).” (p. 526). More interesting is that the deleterious effect of multicollinearity is offset if sample size is large and if the variance (R²) of the dependent variable explained by the independent variables is high. Remember that latent variables approach normally makes R² to increase while at the same time it favors multicollinearity between latent variables.

**2.2.8. Sampling weight.**

We can use the “customized weights” available in this **webpage**; survey year 1997, 1999, 2010, and choose “all” of the selected years, for a truly longitudinal weight (choosing “all” and “any or all” produces the same numbers given several tests of regressions). There is no option in AMOS for the inclusion of weight variables for estimating parameters in CFA or SEM models. I am aware of only two ways for dealing with this problem :

1) In SPSS, activate your weight. Activate your filter, by race. Then, run the SPSS syntax displayed in this **webpage** with the relevant variables in your data set. The syntax command LIST will define a raw data file by assigning names and formats to each variable in the file, and MCONVERT will convert a correlation matrix into a covariance matrix or vice versa. Save the file that is created, and use it as input for AMOS. The rows must contain means, SD, variances, covariances, and N.

2) Include the weight variable as a covariate in the imputations with the variables for which missing values have to be imputed (SPSS -> “impute missing data values” -> “analysis weight”). Cases with negative or zero weight are excluded. See Carpenter (**2011**, **2012**), Andridge & Little (**2010**, **2011**) for further discussion.

I haven’t use any weights in the imputation process. That won’t distort the result, given the great similitude of estimates in several multiple regressions I performed on SPSS, with and without weights.

**3. Results.**

**3.1. CFA. Measurement model.**

**a) Procedure.**

Before going to CFA, we need to perform some Exploratory Factor Analyses (EFA). This is simply done by using the “dimension reduction” process, option “principal axis factor” because PCA would not be optimal with rotated solution, according to Costello & Osborne (**2005**, p. 2). These authors said that maximum-likelihood and PAF generally give the best results but ML is only recommended if we are certain that (multivariate) distribution normality is respected, a condition that seems to be rarely met in real data. Given this, PAF has been selected.

Possible rotations include Varimax, Quartimax, Equamax, Promax, Direct Oblimin. Field (**2009**, p. 644) says that Quartimax attempts to maximize the spread of factor loadings for a variable across all factors whereas Varimax attempts to maximize the dispersion of loadings within factors. Equamax is the hybrid of the two. If we need oblique rotation, we can choose Oblimin or Promax. Oblimin proceeds by finding rotation of the initial factors that will minimize the cross products of the factor loadings, many of them becoming close to zero. Promax rotates a solution that has been previously rotated with orthogonal methods (Varimax, in fact) and adjusts axes in such a manner as to make those variables with small loadings on a dimension to be near zero. If factors were independent, they should be uncorrelated and orthogonal would produce similar results as oblique rotation. Field recommends Varimax if we want more interpretable, independent clusters of factors.

But these factors in ASVAB are assumed to be correlated and an orthogonal factor rotation would reveal a good correlation (in the table Factor Correlation Matrix) between the factors anyway. Thus I have opted for the oblique rotation Promax, often recommended for giving a better oblique simple structure (Coffey, **2006**), which has a Kappa value of 4 in SPSS. This was the default option because it generally provides a good solution (see **here**).

Usually (at least in SPSS) the default “extraction” option for factor analysis is set at factors greater than 1.0 eigenvalue. Costello & Osborne (**2005**, p. 3) argue this is the least accurate among the methods. They recommend to be cautious about the presence of cross-loadings, in which cases we might think of removing the “anomalous” variables. Any variable with loading lower than 0.30 or cross-loading larger than 0.30 can be excluded. Again, these cut-offs are totally arbitrary. But a coefficient of 0.30 for both is generally what researchers propose, as exclusion criteria.

In the displayed SPSS output, we are provided with several tables, notably a pattern and a structure matrix. Because we want the regression (not correlation) coefficient of, we choose the pattern matrix (Field, **2009**, pp. 631, 666, 669). We want to transpose it to AMOS. To ease the process, James Gaskin has created a special plugin for doing it. Go to his **website** and click on “Amos EFA-CFA Plugin” on the left panel, click on this dll file and select “Unblock” and finally move the dll file to the Amos folder “plugins” with the other dll. This **video** summarizes the procedure if my explanation wasn’t clear enough. When it is done, the “Plugins” option in AMOS Graphics now has a new function labeled “Pattern Matrix Model Builder”. It allows us to directly copy-paste the pattern matrix (e.g., directly from SPSS output) and to create the measurement model diagram accordingly.

Because it is possible given certain version of SPSS that the displayed numbers in the ouput may have a comma instead of a dot, we can correct for this using the **following syntax** :

SET LOCALE="en_US.windows-1252".

**b) Results.**

The main problem I came through is the presence of numerous cross-loadings and more threatening, the pattern of loadings on the latent factors that differs somewhat across all between-race comparisons. That means configural invariance (equivalence) is already not very strong, but not apparently violated. Conducting MGCFA test gives CFI and RMSEA values >0.90 and <0.10, respectively, although intercepts invariance is strongly violated for any group comparison. But I will dedicate a post on this topic later.

The XLS file (see below) displays the model fit indices for first-order g model, second-order g with 3 group and 4 group factors, with and without cross-loadings when one or more subtests load almost equally well on several factors when these loadings were meaningful (e.g., around 0.30).

It has been difficult to choose among the g-models. For instance, the 3-group g model seem to have a slight advantage over 4-group g model in terms of fit but one of the group factor labeled “school” which comprised verbal and math subtests (as well as some technical knowledge subtests) has a factor loading on g that is greater than 1 for all racial groups. The 4-group g model was the more coherent. The first-order g model has the worse fit perhaps because multiple subtests of verbal, math, speed and technical skills suggest the presence of factors summarizing these variables, before g itself (Little et al., **2002**, p. 170).

Considering the non-use (or fewer) of cross-loadings, it is possible to follow authors’ recommendations from their own CFAs. Coyle et al. (**2008**, **2011**), Ree & Carretta (**1994**) produced an ASVAB 3-group g model (what they call Vernon-like model) with scholastic factor (AR, PC, WK, MK), speed (NO, CS) and technical factor (GS, AI, SI, MC, EI) and also a 4-group g model with verbal/technical (GS, WK, PC, EI), technical (AI, SI, MC, EI), math (AR, MK, MC) and speed (NO, CS). Deary et al. (**2007**) also employed a 4-group g model with verbal (GS, PC, WK), math (AR, MK), speed (NO, CS) and technical factor (GS, AI, SI, MC, EI) as well as Roberts et al. (**2000**, Table 2).

On the other hand, I am very hesitant to use the models employed by others in where there is no (or few) cross-loadings. This won’t be consistent with my series of EFAs from which I have got a lot of cross-loadings everywhere. As noted earlier, incorrectly specifying a non-zero cross-loading to be zero (or the reverse) will lead to misspecification, a situation we must avoid. I believe it is safer to just follow the pattern loadings produced by my own EFAs on the NLSY97.

Here I present my oblique factor analyses. From the left to the right : blacks, hispanics, whites. Some numbers were missing because I requested SPSS to not display coefficients lower than 0.10 (either positive or negative). This will ease the readability. I have chosen the ASVAB without AO subtests because after many runs, it seems the removal of AO yields the more coherent pattern of all. Furthermore, AO appears to be a spatial ability, which may not be unambiguously related to any of the factors revealed by EFAs.

Allowing cross-loadings or/and correlated errors does not really affect structural coefficient paths, but just the factor loadings in the ASVAB measurement model, at least in the present case. Also, when indicators load on multiple factors, their standardized loadings are interpreted as beta weights (β) that control for correlated factors (Kline, **2011**, p. 231). Because beta weights are not correlations, one cannot generally square their values to derive proportions of explained variance.

**3.2. SEM. Structural model. **

When deciding to select standardized or unstandardized estimates, we should bear in mind that the numbers near the curved double-headed arrows represent covariances in unstandardized estimates, and correlations in standardized estimates. Remember that a covariance is the unstandardized form of a correlation; a correlation is computed by dividing covariance by the SD of both variables, removing thus the units of measurement (but rendering correlation sensitive to range restriction in SD). In the standardized solution, the numbers just above the variables having an error term represent the R² values summarizing the % of variance in the dependent variable explainable by the predictor(s). In unstandardized solution, we don’t have this number but instead the estimates of the error variance.

Above is a picture from AMOS graphics (for the black sample). From the measurement model ASVAB, we see the loadings of the indicators on the first-order factors and the loadings of the first-order factors on the second-order factor, namely, g. In order to obtain the factor loading of the indicators on the second-order factor (g) we must multiply the loading of the indicator on the first-order factor by the loading of this first-order factor on the second-order factor (Browne, **2006**, p. 335). So for instance, in the case of General Science (GS) the loadings on g is 0.84*0.94=0.79, and for Coding Speed (CS) the loading on g is 0.67*0.81=0.54. The numbers associated with the indicators, such as 0.71 for GS or 0.44 for CS, represent the variance explained by the factor because 0.84^2=0.71. In the SEM framework, this value of 0.71 is also interpreted as the lower-bound reliability of this variable.

The direct paths from the independent var. to the dependent var. can be interpreted as the effect of this independent var. on the dependent var. when partialling out the influence of the other independent var. Although it is possible to reverse the direction of the arrows, instead of SES->g we will have g->SES, this would be methodologically unwise. Children’s IQ cannot cause (previously) reported parental SES. And the coefficients stay the same anyway. This is because both patterns impose the same restriction on the implied covariance matrix (Tomarken & Waller, **2003**, p. 580). These models are conceptually different but mathematically equivalent. Similarly, if we remove the error term (e21) from ASVAB_g and draw a covariance between ASVAB_g and SES, the paths ASVAB->Achiev and SES->Achiev stay the same. Remember that these structural direct paths can be interpreted as B and Beta coefficients in multiple regression analyses (we only need to specify a covariance between the predictors).

For the analyses, I notice that the correlation of residuals between mom_educ and dad_educ systematically increases (modestly) the positive direct path SES->Achiev.

**a) ASVAB.**

Here I present the results from the original data (Go to the end of the post for the imputed data) using ML. With regard to the imputed data, for blacks, the effect of g on Achiev is somewhat smaller (~0.030) whereas the effect of parental SES on Achiev is higher (~0.030). In total, the overall difference is not trivial. Hopefully, for hispanics and whites, there was virtually no difference in the pooled imputed and the original data.

In the case of blacks, the direct path SES->Achiev amounts to 0.37. The paths SES->g and g->Achiev amount to 0.60 and 0.50, which yields a Beta coefficient of 0.60*0.50=0.30 for the indirect effect of SES on Achiev. The total effect is thus 0.37+0.30=0.67, compared to 0.50 for g. In the model with only SES and Achiev factors, the path is 0.64, thus the amount of mediation could be 0.64-0.37=0.27.

In the case of hispanics, the direct path SES->Achiev amounts to 0.24. The paths SES->g and g->Achiev amount to 0.63 and 0.52, which yields a Beta coefficient of 0.63*0.52=0.33 for the indirect effect of SES on Achiev. The total effect is thus 0.24+0.33=0.57, compared to 0.52 for g. In the model with only SES and Achiev factors, the path is 0.50, thus the amount of mediation could be 0.50-0.24=0.26.

In the case of whites, the direct path SES->Achiev amounts to 0.38. The paths SES->g and g->Achiev amount to 0.52 and 0.56, which yields a Beta coefficient of 0.52*0.56=0.29 for the indirect effect of SES on Achiev. The total effect is thus 0.29+0.38=0.67, compared to 0.56 for g. In the model with only SES and Achiev factors, the path is 0.67, thus the amount of mediation could be 0.67-0.38=0.29.

Next, I re-conduct the analysis this time using a latent GPA from 5 measures (english, foreign languages, math, social science, life and physical sciences) instead of Achiev factor (GPA+grade). Among blacks and whites, the direct effect of SES on GPA is very small (0.11 and 0.16), and has a negative sign among hispanics (-0.07).

**b) ACT/SAT.**

Here, I report the result from ML method. I have not used imputations because lack of time and because the % of missing values is extremely high. In ACT variables when SES and GPA/grade variables are included in an analysis of missing data patterns this amounts to 87% for blacks, 93% for hispanics, 80% for whites. The respective numbers for SAT variables (math and verbal) are 89%, 89%, 79%.

Also, remember that the residuals for ACT-English and ACT-Read are here correlated, as well as the residuals for mother and father’s education.

In the case of blacks, the direct path SES->Achiev amounts to 0.33. The paths SES->ACT-g and ACT-g->Achiev amount to 0.56 and 0.66, which yields a Beta coefficient of 0.56*0.66=0.37 for the indirect effect of SES on Achiev. The total effect for SES is thus 0.33+0.37=0.70, compared to 0.66 for ACT-g. In the model with only SES and Achiev factors, the path is 0.64, thus the amount of mediation could be 0.64-0.33=0.31.

In the case of hispanics, the direct path SES->Achiev amounts to 0.16. The paths SES->ACT-g and ACT-g->Achiev amount to 0.64 and 0.60, which yields a Beta coefficient of 0.64*0.60=0.38 for the indirect effect of SES on Achiev. The total effect for SES is thus 0.16+0.38=0.54, compared to 0.60 for ACT-g. In the model with only SES and Achiev factors, the path is 0.50, thus the amount of mediation could be 0.50-0.16=0.34.

In the case of whites, the direct path SES->Achiev amounts to 0.24. The paths SES->ACT-g and ACT-g->Achiev amount to 0.58 and 0.76, which yields a Beta coefficient of 0.58*0.76=0.44 for the indirect effect of SES on Achiev. The total effect for SES is thus 0.24+0.44=0.68, compared to 0.76 for ACT-g. In the model with only SES and Achiev factors, the path is 0.67, thus the amount of mediation could be 0.67-0.24=0.43.

For SAT-g (using the verbal and math sections as observed variables), the paths SES->Achiev, SES->SAT-g, SAT-g->Achiev amount to 0.43, 0.61, 0.39 for blacks, 0.40, 0.33, 0.48 for hispanics, and 0.30, 0.58, 0.67 for whites. Clearly, these numbers seem impressively disparate.

Here again, I re-conduct the analysis this time using a latent GPA from 5 measures (english, foreign languages, math, social science, life and physical sciences) instead of Achiev (GPA+grade). For ACT, among blacks and whites, the direct effect of SES on GPA amounts to zero, and has a negative sign among hispanics (-0.193). In the model with only SES and GPA, the SES->GPA path is 0.28 (blacks), 0.17 (hispanics), 0.41 (whites).

**3.3. Replication of Brodnick & Ree (1995).**

Brodnick & Ree (**1995**) attempted to show which structural model best summarizes the relationship between parental SES, SAT/ACT-g, and GPA (school grades). They demonstrate that the addition of a latent SES factor to SAT/ACT-g and GPA latent factors do not improve model fit over a model with only SAT/ACT-g and GPA latent factors. This finding has been interpreted as to say that after including g, SES has no additional explanatory power. This is probably because in Brodnick & Ree, their two other indicators (parental age, family size) are irrelevant as measuring parental SES. Their table 2 shows that these variables do not correlate with any other ones, except family size with income but modestly (0.181). Hence the absence in model fit increment. When I try to replicate their model (using the original data set, not the imputed, with ML method), as shown in their figures, the finding was different. A model including latent SES has better fit than a model with no latent SES. I have used the same variable as the above cited; but concerning GPA, I have included 5 measures (english, foreign language, math, social sciences, life sciences) for constructing the latent GPA factor. Now, the reason why I have failed to confirm their result is obvious. All of my observed variables truly measure SES in a good way. So, the model including the latent SES shows improvement in fit.

**4. Limitations.**

The first obvious limitation is that ASVAB seems truly be awful. Roberts et al. (**2000**) rightly point out that “The main reason the ASVAB was constructed without any obviously coherent factorial structure is quite clear. The initial purpose of this multiple-aptitude battery was as a classification instrument. Therefore, tests were selected on the basis of perceived similarities to military occupations rather than any psychological theory.” (pp. 86-87). The authors advanced the idea that because the ASVAB is scholastic/verbally biased, it measures purely scholastic competence. But the fact that ASVAB g-factor score as well as its subtest g-loadings correlate with reaction time test, a prototypical measure of fluid intelligence excludes this interpretation (Jensen, **1998**, pp. 236-238; Larson et al., **1988**). In any case, the fact remains that ASVAB still needs to be revised. And not only because it is racially biased (e.g., when performing MGCFA).

We note a problem with GPA as well. The loading of GPA on the achievement factor is around 0.54 and 0.58 for blacks and hispanics but amounts to 0.76 for whites. When I use a latent GPA in my outcome variable, the R² for blacks and hispanics amounts to 0.19 and 0.16, but amounts to 0.43 for whites. A great amount of % is left unexplained by the same set of variables for minority groups, suggesting additional factors not accounted for by the set of predictors (but statistical artifacts could be suspected too, e.g., higher difficulty causing piling up of scores at the low end distribution and thus a reduction in variance of scores). For ACT-g, the respective numbers for blacks, hispanics and whites are 0.46, 0.42, and 0.72. For SAT-g, the numbers are 0.31, 0.39, 0.63. Why the huge difference in predictive validities ? Because the GPA score is a transcript record from school, that means minorities can’t be suspected to inflate their score or something (see NLSY97 **Appendix-11**). Perhaps GPA has different meanings for blacks/hispanics compared to whites. The same grade may lack comparison at the between-school level, because schools differ in quality. Therefore, I would have expected similar conclusions for Grade variable (highest lvl of education completed) but there was no such minority-majority difference, the R² is similar (somewhat) for blacks and whites but clearly lower (by 0.10-0.15) for hispanics. Now, it must be mentioned that R² is not always easy to interpret and that a convertion of R² in odds ratio reveals that even a low R² can be truly meaningful (Sackett et al., **2008**, p. 216; Jensen, **1980**, pp. 306-307).

Another dissatisfaction concerns the impact of imputations on the parameter estimates. Compared to ML results, the distortion is not trivial for blacks, with ASVAB(g)->Achiev constantly at about 0.46 or 0.47 (0.50 for ML) and SES->Achiev constantly at about 0.40 (0.37 for ML), whereas that difference approaches zero for whites and hispanics. Hence, I performed again the imputation (10 datasets) for each groups, with Linear Regression method and then PMM method. In total, 60 imputation data. At first glance, both method yield the same results. This time, the parameter values look closer to those obtained with maximum likelihood. Interestingly, the paths SES->ASVAB and ASVAB->Achiev appear more or less stable. This wasn’t the case for SES->Achiev. The reason is obvious : the ASVAB variables have no missing values. Furthermore, it seems that SES->Achiev has more stability among whites than among blacks or hispanics. I would guess it was because the white sample had much less missing values. My first impression with regard to imputation is not good. With about 10, 20 and 30% missing for some variables, choosing 5 imputed datasets may not be wise. I believe we need much more than this. Perhaps 15 or 20 at minimum.

Concerning the mediational effect of parental SES on GPA, the removing of SES->GPA path does not impact the model fit, unlike ASVAB/ACT/SAT->GPA path. This looks curious because in the black and white sample for the ASVAB model, the path is 0.11 and 0.16, clearly different from zero, but the impact on model fit is not all clear. Either that means SES (independent of g) has really no meaningful impact on GPA, or that fit indices are not sensitive enough in the detection of misspecification. I would say the first hypothesis seems more likely.

When examining the total effect of SES versus total effect of ASVAB, or ACT/SAT, it would seem that parental SES is more important than ASVAB in predicting the achievement. But knowing that the pre-requisite for causal inferences has probably not been met here (e.g., controlling for previous SES and IQ when examining the impact of the mediators by having SES, IQ, grade variables also at previous waves) I would not recommend to misinterpret these numbers. The only robust conclusion presently is that the direct path SES->achiev is much less than the direct path ASVAB/ACT/SAT->achiev, consistent with Jensen’s earlier statement.

Both univariate and multivariate non-normality have the potential to distort parameter estimates and to falsely reject models. They seriously should be taken into account. Concerning the outliers, I have detected only two cases (in the black sample). These were the highest d² values, 67 and 62, that depart somewhat from the values that follow immediately after them, i.e., 52, 51, 51, 50, 47, 46, 46, and so forth. Those two values removed, I see no change in the parameter estimates and fit indices. It’s likely that their impact have been attenuated by the large N. Next is the univariate skew and kurtosis. None of the variables exceeded 2 for skew or 10 for kurtosis. The highest values were for Dad_educ and Mom_educ with kurtosis of about ~2.5 and ~1.5, respectively, in the black sample again. The values of multivariate kurtosis ranged between 18 and 19, between 13 and 14, between 21 and 22, in black, hispanic, and white samples, respectively. As can be seen, there is no great variability in the kurtosis values due to imputation. Anyway, these values were much lower than Bollen’s (**1989**) threshold value of p(p+2) which equals to 16*(16+2)=288, for p being the number of observed variables. With respect to the bootstrap estimates, they reveal no difference compared with ML estimates for each of the imputed data sets.

**5. Discussion.**

Truly, that post is just a pretense for showing how SEM works. So, don’t get me wrong. There is nothing special in this analysis or even in his conclusion. It’s just boring and doesn’t worth the huge amount of time I spent on it. Nevertheless, some comments are worth mentioning.

The fact that parental SES correlates with child’s (later) success does not imply the link works entirely through an environmental path (Trzaskowski et al., **2014**). Earlier studies show that children’s IQ has a stronger relationship with children’s later SES (at adulthood) than parental SES with children’s later SES (Herrnstein & Murray, **1994**, ch. 5 and 6; Jensen, **1998**, p. 384; Saunders, **1996**, **2002**; Strenze, **2007**, see Table 1 (column p) and footnote 9). This pattern is more consistent with a causal path from earlier IQ toward later SES than the reverse. To some extent, this is also in line with a path analysis study showing that earlier IQ causing later IQ+achievement is a more likely hypothesis than earlier achievement causing later IQ+achievement (Watkins et al., **2007**).

And, as we see above, ACT-g has a much more explanatory power than ASVAB-g on the Achievement factor. One possible reason is that ACT has more scholastic component in it; another is that ACT measures g much better than ASVAB although I see no specific reason for this. Anyway, the difference seems huge enough. This could be consistent with Coyle & Pillow (**2008**, Figure 2) who concluded that ACT and SAT still predict GPA even after removing the influence of ASVAB-g, with Beta coefficients similar to that of g.

**Recommended readings for introduction on SEM :**

Beaujean Alexander A. (2012). Latent Variable Models in Education.

Hooper Daire, Coughlan Joseph, Mullen Michael R. (2008). Structural Equation Modelling: Guidelines for Determining Model Fit.

Hu Changya (2010). Bootstrapping in Amos.

Kenny David A. (2011). Terminology and Basics of SEM.

Kenny David A. (2013). Measuring Model Fit.

Lei Pui-Wa and Wu Qiong (2007). Introduction to Structural Equation Modeling: Issues and Practical Considerations.

Tomarken Andrew J., Waller Niels G. (2005). Structural Equation Modeling: Strengths, Limitations, and Misconceptions.

**Recommended readings on technical issues related to SEM :**

Allison Paul D. (2002). Missing data.

Allison Paul D. (2009). Missing Data, Chapter 4, in The SAGE Handbook of Quantitative Methods in Psychology (Millsap, & Maydeu-Olivares, 2009).

Allison Paul D. (November 9, 2012). Why You Probably Need More Imputations Than You Think.

Andridge Rebecca R., Little Roderick J. A. (2010). A Review of Hot Deck Imputation for Survey Non-response.

Baron Reuben M., Kenny David A. (1986). The Moderator-Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations.

Bentler P.M., Chou Chih-Ping (1987). Practical Issues in Structural Modeling.

Browne Michael W., MacCallum Robert C., Kim Cheong-Tag, Andersen Barbara L., Glaser Ronald (2002). When Fit Indices and Residuals Are Incompatible.

Cheung Gordon W., Rensvold Roger B. (2001). The Effects of Model Parsimony and Sampling Error on the Fit of Structural Equation Models.

Cheung Gordon W., Lau Rebecca S. (2008). Testing Mediation and Suppression Effects of Latent Variables: Bootstrapping With Structural Equation Models.

Cole David A., Maxwell Scott E. (2003). Testing Mediational Models With Longitudinal Data: Questions and Tips in the Use of Structural Equation Modeling.

Cole David A., Ciesla Jeffrey A., Steiger James H. (2007). The Insidious Effects of Failing to Include Design-Driven Correlated Residuals in Latent-Variable Covariance Structure Analysis.

Costello Anna B. & Osborne Jason W. (2005). Best Practices in Exploratory Factor Analysis: Four Recommendations for Getting the Most From Your Analysis.

Fan Xitao, Sivo Stephen A. (2007). Sensitivity of Fit Indices to Model Misspecification and Model Types.

Gao Shengyi, Mokhtarian Patricia L., Johnston Robert A. (2008). Nonnormality of Data in Structural Equation Models.

Graham John W. (2009). Missing Data Analysis: Making It Work in the Real World.

Grewal Rajdeep, Cote Joseph A., Baumgartner Hans (2004). Multicollinearity and Measurement Error in Structural Equation Models: Implications for Theory Testing.

Hardt Jochen, Herke Max, Leonhart Rainer (2012). Auxiliary variables in multiple imputation in regression with missing X: A warning against including too many in small sample research.

Heene Moritz, Hilbert Sven, Draxler Clemens, Ziegler Matthias, Bühner Markus (2011). Masking Misfit in Confirmatory Factor Analysis by Increasing Unique Variances: A Cautionary Note on the Usefulness of Cutoff Values of Fit Indices.

Hopwood Christopher J. (2007). Moderation and Mediation in Structural Equation Modeling: Applications for Early Intervention Research.

Kenny David A. (2011). Miscellaneous Variables.

Kenny David A. (2011). Single Latent Variable Model.

Kline Rex B. (2013). Reverse arrow dynamics: Feedback loops and formative measurement.

Koller-Meinfelder Florian (2009). Analysis of Incomplete Survey Data – Multiple Imputation via Bayesian Bootstrap Predictive Mean Matching.

Lee Katherine J., Carlin John B. (2009). Multiple Imputation for Missing Data: Fully Conditional Specification Versus Multivariate Normal Imputation.

Little Todd D., Cunningham William A., Shahar Golan (2002). To Parcel or Not to Parcel: Exploring the Question, Weighing the Merits.

MacCallum Robert C., Roznowski Mary, Necowitz Lawrence B. (1992). Model Modifications in Covariance Structure Analysis: The Problem of Capitalization on Chance.

MacKenzie Scott B., Podsakoff Philip M., Jarvis Cheryl Burke (2005). The Problem of Measurement Model Misspecification in Behavioral and Organizational Research and Some Recommended Solutions.

Osborne, Jason W. & Amy Overbay (2004). The power of outliers (and why researchers should always check for them).

Reddy Srinivas K. (1992). Effects of Ignoring Correlated Measurement Error in Structural Equation Models.

Savalei Victoria (2012). The Relationship Between Root Mean Square Error of Approximation and Model Misspecification in Confirmatory Factor Analysis Models.

Sharma Subhash, Mukherjeeb Soumen, Kumarc Ajith, Dillon William R. (2005). A simulation study to investigate the use of cutoff values for assessing model fit in covariance structure models.

Tomarken Andrew J., Waller Niels G. (2003). Potential Problems With “Well Fitting” Models.

van Buuren Steph (2007). Multiple imputation of discrete and continuous data by fully conditional specification.

van Buuren Steph, Brand J.P.L., Groothuis-Oudshoorn C.G.M., Rubin D.B. (2006). Fully conditional specification in multivariate imputation.

**Required readings for AMOS users :**

Structural Equation Modeling using AMOS: An Introduction.

Structural Equation Models: Introduction – SEM with Observed Variables Exercises – AMOS.

Confirmatory Factor Analysis using Amos, LISREL, Mplus, SAS/STAT CALIS.

**Required books for AMOS users :**

Arbuckle James L. (2011). IBM® SPSS® Amos 20 User’s Guide.

Byrne Barbara M. (2010). Structural Equation Modeling With AMOS: Basic Concepts, Applications, and Programming, Second Edition.

Kline Rex B. (2011). Principles and Practice of Structural Equation Modeling, Third Edition.

**Required books for SPSS users :**

IBM SPSS Missing Values 20.

Graham John W. (2012). Multiple Imputation and Analysis with SPSS 17-20.

*The XLS spreadsheet is **here**. The syntax is displayed **here**.*

Il semblerait que bon nombre de gens, en particulier sur Internet, confondent absolument tout quand il s’agit de parler d’un sujet concernant des études de corrélations dans les sciences sociales. Ils proposent un point de vue tronqué, biaisé, incomplet de la réalité telle qu’elle est actuellement.

**I get the impression that a lot of people around the web confound absolutely everything when talking about correlations on the topic of social sciences. They supply a biased, incomplete, or truncated reality of what actually really is.**

Un bon exemple. Disons, lorsque dans le Pays #1, on y trouve une corrélation négative entre PIB et corruption. Le contre-argument typique consiste à dire que dans le Pays #2, le PIB est plus fort, mais la corruption aussi. Les variantes de cet argument sont innombrables.

**A good example of this is when it is said that in Country #1, there is a negative association between corruption and GDP. The typical counter-argument is that Country #2 has more corruption but also higher GDP. Variants of that argument are simply innumerable.**

L’erreur évidente est la négligence sérieuse des facteurs confondants. Ce qui diffère entre Pays #1 et #2 n’est pas seulement le PIB et la corruption mais des milliers d’autres paramètres qui n’ont pas été prises en compte dans la corrélation bivariée. Et même dans la régression multiple.

**The obvious mistake is the serious neglect of confounding factors. What differ between Country #1 and #2 are not only GDP and corruption but thousands other parameters which have not been taken into account in the bivariate correlation. And even in a multiple regression.**

Les chercheurs ne tombent généralement pas dans ce piège, et bien qu’ils ne l’expriment pas explicitement, ils basent leur conclusion que la variable #1 corrèle avec #2 compte tenu de l’hypothèse toutes-choses-égales. C’est-à-dire, quand toute autre chose est maintenue constante, il y a une connexion entre ces deux variables testées présentement.

**Researchers do not generally fall into this trap, and although they do not express it explicitly, base the conclusion that variable #1 correlate with #2 given the all-else-equal assumption. That is, when everything else is constant, there is a connection between these two variables presently tested.**

Bien évidemment, il est impossible d’inclure tous les confondants possibles parce que les humains ne sont pas omniscients, ne sont pas des dieux. Ils incluent donc un ensemble de médiateurs possibles pour lesquels ils soupçonnent être un possible candidat pour la médiation dans les deux variables principales d’intérêt. Le problème émerge quand une étude fournit des résultats concernant les différences individuelles dans un pays donné, ceci est contré par des arguments défectueux affirmant que d’autres pays ont plus de “ceci” ou plus de “cela”, ce qui est considéré comme contredisant la relation établie dans ladite étude.

**Of course, it is not possible to include all possible confounders because humans are not omniscient, are not Gods. They include only a set of possible mediators for which they suspect a possible candidate for mediation in the two principal variables of interest. The problem emerges when a study provides results for individual differences in a given country, it is countered by defectuous arguments stating that other countries have more of “this” or more of “that”, which is taken as contradicting the relationship established in the study.**

Ce point de vue sur-simplifié ignore totalement le fait que les différences à l’intérieur d’un groupe ne sont pas nécessairement équivalentes aux différences entre groupes, pour les raisons détaillées ci-dessus. La signification d’un paramètre n’est pas nécessairement équivalente à travers différents groupes (raciaux, par exemple). Admettons, ce qui rend la Population #1 heureuse ne rend pas nécessairement heureuse la Population #2. Et même si c’était le cas, il peut toujours y avoir de larges disparités dans les corrélations quand on en vient à les comparer à travers divers groupes. Aussi, il est possible qu’un “score” sur une certaine variable X est atteint par différents chemins causaux. Dans ce cas, le score sur la variable X n’est pas comparable, généralisable aux autres groupes, à moins de pouvoir détecter et corriger les causes derrière cette “anomalie”. Le problème devient plus clair quand la variable (biaisée) X est utilisée pour les corrélations avec d’autres variables. Les inférences ne sont plus possibles, indéniablement.

**This over-simplistic view totally ignores the fact that within-group variation is not necessary equivalent to between-group variation, for reasons detailed above. The meaning of one parameter is not necessarily equivalent across (i.e., racial) groups. For example, what makes population #1 happy does not necessarily makes population #2 happy. And even when this is the case, there still might be a large difference in the correlations when comparing diverse groups. Also, it is possible that a “score” on a certain variable X is attained through different causal pathways. In this case, score on variable X is not comparable, generalizeable to other groups, unless we detect and correct the causes behind this “anomaly”. The problem becomes clearer when this group-biased variable X is used for correlation with other variables. Inferences can’t be valid, undeniably.**

En fait, il n’existe pas d’état de “toutes choses égales” dans la réalité parce que tout change, bouge ensemble, avec le temps, perpétuellement. Une corrélation à une période donnée pourrait ne plus être valide lorsqu’étudiée dans une autre période. En d’autre termes, quand une relation-causalité est établie, parce que les éléments composant le monde réel continuent à changer, certaines preuves précédemment établies pourraient ne plus s’avérer valide à une date ultérieure.

**In fact, there is no such “all-else-equal” state in the reality because everything moves together, changes together, over time, perpetually. A correlation at any given time point could not be valid when studied in another time point. In other words, when a relation-causality is established, because elements composing the real world continue to change, some previously established evidence could not be true at a later date.**

L’hypothèse de toutes-choses-égales n’est rien de plus (et rien de moins) qu’un outil mental. On fait l’hypothèse, simplement, sur base théorique, qu’un facteur en cause un autre sous certaines restrictions. Donc, lorsqu’une critique d’une théorie néglige cette hypothèse, elle rate totalement sa cible.

**The all-else-equal assumption is nothing more (and nothing less) than a mental tool. One simply assumes, on some theoretical basis, that one factor causes another one under certain restrictions. So, when a critique of a certain theory neglects this assumption, it completely misses the target.**

Pui-Wa Lei and Qiong Wu, The Pennsylvania State University (Fall 2007)

Structural equation modeling (SEM) is a versatile statistical modeling tool. Its estimation techniques, modeling capacities, and breadth of applications are expanding rapidly. This module introduces some common terminologies. General steps of SEM are discussed along with important considerations in each step. Simple examples are provided to illustrate some of the ideas for beginners. In addition, several popular specialized SEM software programs are briefly discussed with regard to their features and availability. The intent of this module is to focus on foundational issues to inform readers of the potentials as well as the limitations of SEM. Interested readers are encouraged to consult additional references for advanced model types and more application examples.

Structural equation modeling (SEM) has gained popularity across many disciplines in the past two decades due perhaps to its generality and flexibility. As a statistical modeling tool, its development and expansion are rapid and ongoing. With advances in estimation techniques, basic models, such as measurement models, path models, and their integration into a general covariance structure SEM analysis framework have been expanded to include, but are by no means limited to, the modeling of mean structures, interaction or nonlinear relations, and multilevel problems. The purpose of this module is to introduce the foundations of SEM modeling with the basic covariance structure models to new SEM researchers. Readers are assumed to have basic statistical knowledge in multiple regression and analysis of variance (ANOVA). References and other resources on current developments of more sophisticated models are provided for interested readers.

**What is Structural Equation Modeling?**

Structural equation modeling is a general term that has been used to describe a large number of statistical models used to evaluate the validity of substantive theories with empirical data. Statistically, it represents an extension of general linear modeling (GLM) procedures, such as the ANOVA and multiple regression analysis. One of the primary advantages of SEM (vs. other applications of GLM) is that it can be used to study the relationships among latent constructs that are indicated by multiple measures. It is also applicable to both experimental and non-experimental data, as well as cross-sectional and longitudinal data. SEM takes a confirmatory (hypothesis testing) approach to the multivariate analysis of a structural theory, one that stipulates causal relations among multiple variables. The causal pattern of intervariable relations within the theory is specified a priori. The goal is to determine whether a hypothesized theoretical model is consistent with the data collected to reflect this theory. The consistency is evaluated through model-data fit, which indicates the extent to which the postulated network of relations among variables is plausible. SEM is a large sample technique (usually N > 200; e.g., Kline, 2005, pp. 111, 178) and the sample size required is somewhat dependent on model complexity, the estimation method used, and the distributional characteristics of observed variables (Kline, pp. 14–15). SEM has a number of synonyms and special cases in the literature including path analysis, causal modeling, and covariance structure analysis. In simple terms, SEM involves the evaluation of two models: a measurement model and a path model. They are described below.

**Path Model**

Path analysis is an extension of multiple regression in that it involves various multiple regression models or equations that are estimated simultaneously. This provides amore effective and direct way of modeling mediation, indirect effects, and other complex relationship among variables. Path analysis can be considered a special case of SEM in which structural relations among observed (vs. latent) variables are modeled. Structural relations are hypotheses about directional influences or causal relations of multiple variables (e.g., how independent variables affect dependent variables). Hence, path analysis (or the more generalized SEM) is sometimes referred to as causal modeling. Because analyzing interrelations among variables is a major part of SEM and these interrelations are hypothesized to generate specific observed covariance (or correlation) patterns among the variables, SEM is also sometimes called covariance structure analysis.

In SEM, a variable can serve both as a source variable (called an exogenous variable, which is analogous to an independent variable) and a result variable (called an endogenous variable, which is analogous to a dependent variable) in a chain of causal hypotheses. This kind of variable is often called a mediator. As an example, suppose that family environment has a direct impact on learning motivation which, in turn, is hypothesized to affect achievement. In this case motivation is a mediator between family environment and achievement; it is the source variable for achievement and the result variable for family environment. Furthermore, feedback loops among variables (e.g., achievement can in turn affect family environment in the example) are permissible in SEM, as are reciprocal effects (e.g., learning motivation and achievement affect each other). **[1]**

In path analyses, observed variables are treated as if they are measured without error, which is an assumption that does not likely hold in most social and behavioral sciences. When observed variables contain error, estimates of path coefficients may be biased in unpredictable ways, especially for complex models (e.g., Bollen, 1989, p. 151–178). Estimates of reliability for the measured variables, if available, can be incorporated into the model to fix their error variances (e.g., squared standard error of measurement via classical test theory). Alternatively, if multiple observed variables that are supposed to measure the same latent constructs are available, then a measurement model can be used to separate the common variances of the observed variables from their error variances thus correcting the coefficients in the model for unreliability. **[2]**

**Measurement Model**

The measurement of latent variables originated from psychometric theories. Unobserved latent variables cannot be measured directly but are indicated or inferred by responses to a number of observable variables (indicators). Latent constructs such as intelligence or reading ability are often gauged by responses to a battery of items that are designed to tap those constructs. Responses of a study participant to those items are supposed to reflect where the participant stands on the latent variable. Statistical techniques, such as factor analysis, exploratory or confirmatory, have been widely used to examine the number of latent constructs underlying the observed responses and to evaluate the adequacy of individual items or variables as indicators for the latent constructs they are supposed to measure.

The measurement model in SEM is evaluated through confirmatory factor analysis (CFA). CFA differs from exploratory factor analysis (EFA) in that factor structures are hypothesized a priori and verified empirically rather than derived from the data. EFA often allows all indicators to load on all factors and does not permit correlated residuals. Solutions for different number of factors are often examined in EFA and the most sensible solution is interpreted. In contrast, the number of factors in CFA is assumed to be known. In SEM, these factors correspond to the latent constructs represented in the model. CFA allows an indicator to load on multiple factors (if it is believed to measure multiple latent constructs). It also allows residuals or errors to correlate (if these indicators are believed to have common causes other than the latent factors included in the model). Once the measurement model has been specified, structural relations of the latent factors are then modeled essentially the same way as they are in path models. The combination of CFA models with structural path models on the latent constructs represents the general SEM framework in analyzing covariance structures.

**Other Models**

Current developments in SEM include the modeling of mean structures in addition to covariance structures, the modeling of changes over time (growth models) and latent classes or profiles, the modeling of data having nesting structures (e.g., students are nested within classes which, in turn, are nested with schools; multilevel models), as well as the modeling of nonlinear effects (e.g., interaction). Models can also be different for different groups or populations by analyzing multiple sample-specific models simultaneously (multiple sample analysis). Moreover, sampling weights can be incorporated for complex survey sampling designs. See Marcoulides and Schumacker (2001) and Marcoulides and Moustaki (2002) for more detailed discussions of the new developments in SEM.

**How Does SEM Work?**

In general, every SEM analysis goes through the steps of model specification, data collection, model estimation, model evaluation, and (possibly) model modification. Issues pertaining to each of these steps are discussed below.

**Model Specification**

A sound model is theory based. Theory is based on findings in the literature, knowledge in the field, or one’s educated guesses, from which causes and effects among variables within the theory are specified. Models are often easily conceptualized and communicated in graphical forms. In these graphical forms, a directional arrow (→) is universally used to indicate a hypothesized causal direction. The variables to which arrows are pointing are commonly termed endogenous variables (or dependent variables) and the variables having no arrows pointing to them are called exogenous variables (or independent variables). Unexplained covariances among variables are indicated by curved arrows (). Observed variables are commonly enclosed in rectangular boxes and latent constructs are enclosed in circular or elliptical shapes.

For example, suppose a group of researchers have developed a new measure to assess mathematics skills of preschool children and would like to find out (a) whether the skill scores measure a common construct called math ability and (b) whether reading readiness (RR) has an influence on math ability when age (measured in month) differences are controlled for. The skill scores available are: counting aloud (CA) — count aloud as high as possible beginning with the number 1; measurement (M) — identify fundamental measurement concepts (e.g., taller, shorter, higher, lower) using basic shapes; counting objects (CO) — count sets of objects and correctly identify the total number of objects in the set; number naming (NN) — read individual numbers (or shapes) in isolation and rapidly identify the specific number (shape) being viewed; and pattern recognition (PR) — identify patterns using short sequences of basic shapes (i.e., circle, square, and triangle). These skill scores (indicators) are hypothesized to indicate the strength of children’s latent math ability, with higher scores signaling stronger math ability. Figure 1 presents the conceptual model.

The model in Figure 1 suggests that the five skill scores on the right are supposedly results of latent math ability (enclosed by an oval) and that the two exogenous observed variables on the left (RR and age enclosed by rectangles) are predictors of math ability. The two predictors (connected by ) are allowed to be correlated but their relationship is not explained in the model. The latent “math ability” variable and the five observed skill scores (enclosed by rectangles) are endogenous in this example. The residual of the latent endogenous variable (residuals of structural equations are also called disturbances) and the residuals (or errors) of the skill variables are considered exogenous because their variances and interrelationships are unexplained in the model. The residuals are indicated by arrows without sources in Figure 1. The effects of RR and age on the five skill scores can also be perceived to be mediated by the latent variable (math ability). This model is an example of a multiple-indicator multiple-cause model (or MIMIC for short, a special case of general SEM model) in which the skill scores are the indicators and age as well as RR are the causes for the latent variable.

Due to the flexibility in model specification, a variety of models can be conceived. However, not all specified models can be identified and estimated. Just like solving equations in algebra where there cannot be more unknowns than knowns, a basic principle of identification is that a model cannot have a larger number of unknown parameters to be estimated than the number of unique pieces of information provided by the data (variances and covariances of observed variables for covariance structure models in which mean structures are not analyzed). **[3]** Because the scale of a latent variable is arbitrary, another basic principle of identification is that all latent variables must be scaled so that their values can be interpreted. These two principles are necessary for identification but they are not sufficient. The issue of model identification is complex. Fortunately, there are some established rules that can help researchers decide if a particular model of interest is identified or not (e.g., Davis, 1993; Reilly & O’Brien, 1996; Rigdon, 1995).

When a model is identified, every model parameter can be uniquely estimated. A model is said to be over-identified if it contains fewer parameters to be estimated than the number of variances and covariances, just-identified when it contains the same number of parameters as the number of variances and covariances, and under-identified if the number of variances and covariances is less than the number of parameters. Parameter estimates of an over-identified model are unique given a certain estimation criterion (e.g., maximum likelihood). All just-identified models fit the data perfectly and have a unique set of parameter estimates. However, a perfect model-data fit is not necessarily desirable in SEM. First, sample data contain random error and a perfect-fitting model may be fitting sampling errors. Second, because conceptually very different just-identified models produce the same perfect empirical fit, the models cannot be evaluated or compared by conventional means (model fit indices discussed below). When a model cannot be identified, either some model parameters cannot be estimated or numerous sets of parameter values can produce the same level of model fit (as in under-identified models). In any event, results of such models are not interpretable and the models require re-specification.

**Data Characteristics**

Like conventional statistical techniques, score reliability and validity should be considered in selecting measurement instruments for the constructs of interest and sample size needs to be determined preferably based on power considerations. The sample size required to provide unbiased parameter estimates and accurate model fit information for SEM models depends on model characteristics, such as model size as well as score characteristics of measured variables, such as score scale and distribution. For example, larger models require larger samples to provide stable parameter estimates, and larger samples are required for categorical or non-normally distributed variables than for continuous or normally distributed variables. Therefore, data collection should come, if possible, after models of interest are specified so that sample size can be determined a priori. Information about variable distributions can be obtained based on a pilot study or one’s educated guess.

SEM is a large sample technique. That is, model estimation and statistical inference or hypothesis testing regarding the specified model and individual parameters are appropriate only if sample size is not too small for the estimation method chosen. A general rule of thumb is that the minimum sample size should be no less than 200 (preferably no less than 400 especially when observed variables are not multivariate normally distributed) or 5–20 times the number of parameters to be estimated, whichever is larger (e.g., Kline, 2005, pp. 111, 178). Larger models often contain larger number of model parameters and hence demand larger sample sizes. Sample size for SEM analysis can also be determined based on a priori power considerations. There are different approaches to power estimation in SEM (e.g., MacCallum, Browne, & Sugawara, 1996 on the root mean square error of approximation (RMSEA) method; Satorra & Saris, 1985; Yung & Bentler, 1999 on bootstrapping; Muthén & Muthén, 2002 on Monte Carlo simulation). However, an extended discussion of each is beyond the scope of this module.

**Model Estimation [4]**

A properly specified structural equation model often has some fixed parameters and some free parameters to be estimated from the data. As an illustration, Figure 2 shows the diagram of a conceptual model that predicts reading (READ) and mathematics (MATH) latent ability from observed scores from two intelligence scales, verbal comprehension (VC) and perceptual organization (PO). The latent READ variable is indicated by basic word reading (BW) and reading comprehension (RC) scores. The latent MATH variable is indicated by calculation (CL) and reasoning (RE) scores. The visible paths denoted by directional arrows (from VC and PO to READ and MATH, from READ to BW and RC, and from MATH to CL and RE) and curved arrows (between VC and PO, and between residuals of READ and MATH) in the diagram are free parameters of the model to be estimated, as are residual variances of endogenous variables (READ, MATH, BW, RC, CL, and RE) and variances of exogenous variables (VC and PO). All other possible paths that are not shown (e.g., direct paths from VC or PO to BW, RC, CL, or RE) are fixed to zero and will not be estimated. As mentioned above, the scale of a latent variable is arbitrary and has to be set. The scale of a latent variable can be standardized by fixing its variance to 1. Alternatively, a latent variable can take the scale of one of its indicator variables by fixing the factor loading (the value of the path from a latent variable to an indicator) of one indicator to 1. In this example, the loading of BW on READ and the loading of CL on MATH are fixed to 1 (i.e., they become fixed parameters). That is, when the parameter value of a visible path is fixed to a constant, the parameter is not estimated from the data.

Free parameters are estimated through iterative procedures to minimize a certain discrepancy or fit function between the observed covariance matrix (data) and the model-implied covariance matrix (model). Definitions of the discrepancy function depend on specific methods used to estimate the model parameters. A commonly used normal theory discrepancy function is derived from the maximum likelihood method. This estimation method assumes that the observed variables are multivariate normally distributed or there is no excessive kurtosis (i.e., same kurtosis as the normal distribution) of the variables (Bollen, 1989, p. 417). **[5]**

The estimation of a model may fail to converge or the solutions provided may be improper. In the former case, SEM software programs generally stop the estimation process and issue an error message or warning. In the latter, parameter estimates are provided but they are not interpretable because some estimates are out of range (e.g., correlation greater than 1, negative variance). These problems may result if a model is ill specified (e.g., the model is not identified), the data are problematic (e.g., sample size is too small, variables are highly correlated, etc.), or both. Multicollinearity occurs when some variables are linearly dependent or strongly correlated (e.g., bivariate correlation > .85). It causes similar estimation problems in SEM as in multiple regression. Methods for detecting and solving multicollinearity problems established for multiple regression can also be applied in SEM.

**Model Evaluation**

Once model parameters have been estimated, one would like to make a dichotomous decision, either to retain or reject the hypothesized model. This is essentially a statistical hypothesis-testing problem, with the null hypothesis being that the model under consideration fits the data. The overall model goodness of fit is reflected by the magnitude of discrepancy between the sample covariance matrix and the covariance matrix implied by the model with the parameter estimates (also referred to as the minimum of the fit function or Fmin). Most measures of overall model goodness of fit are functionally related to Fmin. The model test statistic (N – 1)Fmin, where N is the sample size, has a chi-square distribution (i.e., it is a chi-square test) when the model is correctly specified and can be used to test the null hypothesis that the model fits the data. Unfortunately, this test statistic has been found to be extremely sensitive to sample size. That is, the model may fit the data reasonably well but the chi-square test may reject the model because of large sample size.

In reaction to this sample size sensitivity problem, a variety of alternative goodness-of-fit indices have been developed to supplement the chi-square statistic. All of these alternative indices attempt to adjust for the effect of sample size, and many of them also take into account model degrees of freedom, which is a proxy for model size. Two classes of alternative fit indices, incremental and absolute, have been identified (e.g., Bollen, 1989, p. 269; Hu & Bentler, 1999). Incremental fit indices measure the increase in fit relative to a baseline model (often one in which all observed variables are uncorrelated). Examples of incremental fit indices include normed fit index (NFI; Bentler & Bonett, 1980), Tucker-Lewis index (TLI; Tucker & Lewis, 1973), relative noncentrality index (RNI; McDonald & Marsh, 1990), and comparative fit index (CFI; Bentler, 1989, 1990). Higher values of incremental fit indices indicate larger improvement over the baseline model in fit. Values in the .90s (or more recently ≥ .95) are generally accepted as indications of good fit.

In contrast, absolute fit indices measure the extent to which the specified model of interest reproduces the sample covariance matrix. Examples of absolute fit indices include Jöreskog and Sörbom’s (1986) goodness-of-fit index (GFI) and adjusted GFI (AGFI), standardized root mean square residual (SRMR; Bentler, 1995), and the RMSEA (Steiger & Lind, 1980). Higher values of GFI and AGFI as well as lower values of SRMR and RMSEA indicate better model-data fit.

SEM software programs routinely report a handful of goodness-of-fit indices. Some of these indices work better than others under certain conditions. It is generally recommended that multiple indices be considered simultaneously when overall model fit is evaluated. For instance, Hu and Bentler (1999) proposed a 2-index strategy, that is, reporting SRMR along with one of the fit indices (e.g., RNI, CFI, or RMSEA). The authors also suggested the following criteria for an indication of good model-data fit using those indices: RNI (or CFI) ≥ .95, SRMR ≤ .08, and RMSEA ≤ .06. Despite the sample size sensitivity problem with the chi-square test, reporting the model chi-square value with its degrees of freedom in addition to the other fit indices is recommended.

Because some solutions may be improper, it is prudent for researchers to examine individual parameter estimates as well as their estimated standard errors. Unreasonable magnitude (e.g., correlation > 1) or direction (e.g., negative variance) of parameter estimates or large standard error estimates (relative to others that are on the same scale) are some indications of possible improper solutions.

If a model fits the data well and the estimation solution is deemed proper, individual parameter estimates can be interpreted and examined for statistical significance (whether they are significantly different from zero). The test of individual parameter estimates for statistical significance is based on the ratio of the parameter estimate to its standard error estimate (often called z-value or t-value). As a rough reference, absolute value of this ratio greater than 1.96 may be considered statistically significant at the .05 level. Although the test is proper for unstandardized parameter estimates, standardized estimates are often reported for ease of interpretation. In growth models and multiple-sample analyses in which different variances over time or across samples may be of theoretical interest, unstandardized estimates are preferred.

As an example, Table 1 presents the simple descriptive statistics of the variables for the math ability example (Figure 1), and Table 2 provides the parameter estimates (standardized and unstandardized) and their standard error estimates. This model fit the sample data reasonably well as indicated by the selected overall goodness-of-fit statistics: χ² 13 = 21.21, p = .069, RMSEA = .056 (<.06), CFI = .99 (>.95), SRMR = .032 (<.08). The model solution is considered proper because there are no out-of-range parameter estimates and standard error estimates are of similar magnitude (see Table 2). All parameter estimates are considered large (not likely zero) because the ratios formed by unstandardized parameter estimates to their standard errors (i.e., z-values or t-values) are greater than |2| (Kline, 2005, p. 41). Standardized factor loadings in measurement models should fall between 0 and 1 with higher values suggesting better indications of the observed variables for the latent variable. All standardized loadings in this example are in the neighborhood of .7, showing that they are satisfactory indicators for the latent construct of math ability. Coefficients for the structural paths are interpreted in the same way as regression coefficients. The standardized coefficient value of .46 for the path from age to math ability suggests that as children grow by one standard deviation of age in months (about 6.7 months), their math ability is expected to increase by .46 standard deviation holding RR constant. The standardized value of .40 for the path from RR to math ability reveals that for every standard deviation increase in RR, math ability is expected to increase by .40 standard deviation, holding age constant. The standardized residual variance of .50 for the latent math variable indicates that approximately 50% of variance in math is unexplained by age and RR. Similarly, standardized residual or error variances of the math indicator variables are taken as the percentages of their variances unexplained by the latent variable.

**Model Modification, Alternative Models, and Equivalent Models**

When the hypothesized model is rejected based on goodness-of-fit statistics, SEM researchers are often interested in finding an alternative model that fits the data. Post hoc modifications (or model trimming) of the model are often aided by modification indices, sometimes in conjunction with the expected parameter change statistics. Modification index estimates the magnitude of decrease in model chi-square (for 1 degrees of freedom) whereas expected parameter change approximates the expected size of change in the parameter estimate when a certain fixed parameter is freely estimated. A large modification index (>3.84) suggests that a large improvement in model fit as measured by chi-square can be expected if a certain fixed parameter is freed. The decision of freeing a fixed parameter is less likely affected by chance if it is based on a large modification index as well as a large expected parameter change value.

As an illustration, Table 3 shows the simple descriptive statistics of the variables for the model of Figure 2, and Table 4 provides the parameter estimates (standardized and unstandardized) and their standard error estimates. Had one restricted the residuals of the latent READ and MATH variables to be uncorrelated, the model would not fit the sample data well as suggested by some of the overall model fit indices: χ²_{6} = 45.30, p < .01, RMSEA = .17 (>.10), SRMR = .078 (acceptable because it is < .08). The solution was also improper because there was a negative error variance estimate. The modification index for the covariance between the residuals of READ and MATH was 33.03 with unstandardized expected parameter change of 29.44 (standardized expected change = .20). There were other large modification indices. However, freeing the residual covariance between READ and MATH was deemed most justifiable because the relationship between these two latent variables was not likely fully explained by the two intelligence subtests (VC and PO). The modified model appeared to fit the data quite well (χ²_{5} = 8.63, p = .12, RMSEA = .057, SRMR = .017). The actual chi-square change from 45.30 to 8.63 (i.e., 36.67) was slightly different from the estimated change (33.03), as was the actual parameter change (31.05 vs. 29.44; standardized value = .21 vs. .20). The differences between the actual and estimated changes are slight in this illustration because only one parameter was changed. Because parameter estimates are not independent of each other, the actual and expected changes may be very different if multiple parameters are changed simultaneously, or the order of change may matter if multiple parameters are changed one at a time. In other words, different final models can potentially result when the same initial model is modified by different analysts.

As a result, researchers are warned against making a large number of changes and against making changes that are not supported by strong substantive theories (e.g., Byrne, 1998, p. 126). Changes made based on modification indices may not lead to the “true” model in a large variety of realistic situations (MacCallum, 1986; MacCallum, Roznowski, & Necowitz, 1992). The likelihood of success of post hoc modification depends on several conditions: It is higher if the initial model is close to the “true” model, the search continues even when a statistically plausible model is obtained, the search is restricted to paths that are theoretically justifiable, and the sample size is large (MacCallum, 1986). Unfortunately, whether the initially hypothesized model is close to the “true” model is never known in practice. Therefore, one can never be certain that the modified model is closer to the “true” model.

Moreover, post hoc modification changes the confirmatory approach of SEM. Instead of confirming or disconfirming a theoretical model, modification searches can easily turn modeling into an exploratory expedition. The model that results from such searches often capitalizes on chance idiosyncrasies of sample data and may not generalize to other samples (e.g., Browne & Cudeck, 1989; Tomarken & Waller, 2003). Hence, not only is it important to explicitly account for the specifications made post hoc (e.g., Tomarken & Waller, 2003), but it is also crucial to cross-validate the final model with independent samples (e.g., Browne & Cudeck, 1989).

Rather than data-driven post hoc modifications, it is often more defensible to consider multiple alternative models a priori. That is, multiple models (e.g., based on competing theories or different sides of an argument) should be specified prior to model fitting and the best fitting model is selected among the alternatives. Jöreskog (1993) discussed different modeling strategies more formally and referred to the practice of post hoc modification as model generating, the consideration of different models a priori as alternative models, and the rejection of the misfit hypothesized model as strictly confirmatory.

As models that are just-identified will fit the data perfectly regardless of the particular specifications, different just-identified models (sub-models or the entire model) detailed for the same set of variables are considered equivalent. Equivalent models may be very different in implications but produce identical model-data fit. For instance, predicting verbal ability from quantitative ability may be equivalent to predicting quantitative ability from verbal ability or to equal strength of reciprocal effects between verbal and quantitative ability. In other words, the direction of causal hypotheses cannot be ruled out (or determined) on empirical grounds using cross-sectional data but on theoretical foundations, experimental control, or time precedence if longitudinal data are available. See MacCallum, Wegener, Uchino, and Fabrigar (1993) and Williams, Bozdogan, and Aiman-Smith (1996) for more detailed discussions of the problems and implications of equivalent models. Researchers are encouraged to consider different models that may be empirically equivalent to their selected final model(s) before they make any substantial claims. See Lee and Hershberger (1990) for ideas on generating equivalent models.

**Causal Relations**

Although SEM allows the testing of causal hypotheses, a well fitting SEM model does not and cannot prove causal relations without satisfying the necessary conditions for causal inference, partly because of the problems of equivalent models discussed above. The conditions necessary to establish causal relations include time precedence and robust relationship in the presence or absence of other variables (see Kenny, 1979, and Pearl, 2000, for more detailed discussions of causality). A selected well-fitting model in SEM is like a retained null hypothesis in conventional hypothesis testing. It remains plausible among perhaps many other models that are not tested but may produce the same or better level of fit. SEM users are cautioned not to make unwarranted causal claims. Replications of findings with independent samples are essential especially if the models are obtained based on post hoc modifications. Moreover, if the models are intended to be used in predicting future behaviors, their utility should be evaluated in that context.

Software Programs

Most SEM analyses are conducted using one of the specialized SEM software programs. However, there are many options, and the choice is not always easy. Below is a list of the commonly used programs for SEM. Special features of each program are briefly discussed. It is important to note that this list of programs and their associated features is by no means comprehensive. This is a rapidly changing area and new features are regularly added to the programs. Readers are encouraged to consult the web sites of software publishers for more detailed information and current developments.

LISREL

LISREL (linear structural relationships) is one of the earliest SEM programs and perhaps the most frequently referenced program in SEM articles. Its version 8 (Jöreskog & Sörbom, 1996a, 1996b) has three components: PRELIS, SIMPLIS, and LISREL. PRELIS (pre-LISREL) is used in the data preparation stage when raw data are available. Its main functions include checking distributional assumptions, such as univariate and multivariate normality, imputing data for missing observations, and calculating summary statistics, such as Pearson covariances for continuous variables, polychoric or polyserial correlations for categorical variables, means, or asymptotic covariance matrix of variances and covariances (required for asymptotically distribution-free estimator or Satorra and Bentler’s scaled chi-square and robust standard errors; see footnote 5). PRELIS can be used as a stand-alone program or in conjunction with other programs. Summary statistics or raw data can be read by SIMPLIS or LISREL for the estimation of SEM models. The LISREL syntax requires the understanding of matrix notation while the SIMPLIS syntax is equation-based and uses variable names defined by users. Both LISREL and SIMPLIS syntax can be built through interactive LISREL by entering information for the model construction wizards. Alternatively, syntax can be built by drawing the models on the Path Diagram screen. LISREL 8.7 allows the analysis of multilevel models for hierarchical data in addition to the core models. A free student version of the program, which has the same features as the full version but limits the number of observed variables to 12, is available from the web site of Scientific Software International, Inc. (http://www.ssicentral.com). This web site also offers a list of illustrative examples of LISREL’s basic and new features.

EQS

Version 6 (Bentler, 2002; Bentler & Wu, 2002) of EQS (Equations) provides many general statistical functions including descriptive statistics, t-test, ANOVA, multiple regression, nonparametric statistical analysis, and EFA. Various data exploration plots, such as scatter plot, histogram, and matrix plot are readily available in EQS for users to gain intuitive insights into modeling problems. Similar to LISREL, EQS allows different ways of writing syntax for model specification. The program can generate syntax through the available templates under the “Build_EQS” menu, which prompts the user to enter information regarding the model and data for analysis, or through the Diagrammer, which allows the user to draw the model. Unlike LISREL, however, data screening (information about missing pattern and distribution of observed variables) and model estimation are performed in one run in EQS when raw data are available. Model-based imputation that relies on a predictive distribution of the missing data is also available in EQS. Moreover, EQS generates a number of alternative model chi-square statistics for non-normal or categorical data when raw data are available. The program can also estimate multilevel models for hierarchical data. Visit http://www.mvsoft.com for a comprehensive list of EQS’s basic functions and notable features.

Mplus

Version 3 (Muthén & Muthén, 1998–2004) of the Mplus program includes a Base program and three add-on modules. The Mplus Base program can analyze almost all single-level models that can be estimated by other SEM programs. Unlike LISREL or EQS, Mplus version 3 is mostly syntax-driven and does not produce model diagrams. Users can interact with the Mplus Base program through a language generator wizard, which prompts users to enter data information and select the estimation and output options.Mplus then converts the information into its program-specific syntax. However, users have to supply the model specification in Mplus language themselves. Mplus Base also offers a robust option for non-normal data and a special full-information maximum likelihood estimation method for missing data (see footnote 4). With the add-on modules, Mplus can analyze multilevel models and models with latent categorical variables, such as latent class and latent profile analysis. The modeling of latent categorical variables in Mplus is so far unrivaled by other programs. The official web site of Mplus (http://www.statmodel.com) offers a comprehensive list of resources including basic features of the program, illustrative examples, online training courses, and a discussion forum for users.

Amos

Amos (analysis of moment structure) version 5 (Arbuckle, 2003) is distributed with SPSS (SPSS, Inc., 2006). It has two components: Amos Graphics and Amos Basic. Similar to the LISREL Path Diagram and SIMPLIS syntax, respectively, Amos Graphics permits the specification of models by diagram drawing whereas Amos Basic allows the specification from equation statements. A notable feature of Amos is its capability for producing bootstrapped standard error estimates and confidence intervals for parameter estimates. An alternative full-information maximum likelihood estimation method for missing data is also available in Amos. The program is available at http://www.smallwaters.com or http://www.spss.com/amos/.

Mx

Mx (Matrix) version 6 (Neale, Boker, Xie, & Maes, 2003) is a free program downloadable from http://www.vcu.edu/mx/. The Mx Graph version is for Microsoft Windows users. Users can provide model and data information through the Mx programming language. Alternatively, models can be drawn in the drawing editor of the Mx Graph version and submitted for analysis. Mx Graph can calculate confidence intervals and statistical power for parameter estimates. Like Amos and Mplus, a special form of full-information maximum likelihood estimation is available for missing data in Mx.

Others

In addition to SPSS, several other general statistical software packages offer built-in routines or procedures that are designed for SEM analyses. They include the CALIS (covariance analysis and linear structural equations) procedure of SAS (SAS Institute Inc., 2000; http://www.sas.com/), the RAMONA (reticular action model or near approximation) module of SYSTAT (Systat Software, Inc., 2002; http://www.systat.com/), and SEPATH (structural equation modeling and path analysis) of Statistica (StatSoft, Inc., 2003; http://www.statsoft.com/products/advanced.html).

**Summary**

This module has provided a cursory tour of SEM. Despite its brevity, most relevant and important considerations in applying SEM have been highlighted. Most specialized SEM software programs have become very user-friendly, which can be either a blessing or a curse. Many SEM novices believe that SEM analysis is nothing more than drawing a diagram and pressing a button. The goal of this module is to alert readers to the complexity of SEM. The journey in SEM can be exciting because of its versatility and yet frustrating because the first ride in SEM analysis is not necessarily smooth for everyone. Some may run into data problems, such as missing data, non-normality of observed variables, or multicollinearity; estimation problems that could be due to data problems or identification problems in model specification; or interpretation problems due to unreasonable estimates. When problems arise, SEM users will need to know how to troubleshoot systematically and ultimately solve the problems. Although individual problems vary, there are some common sources and potential solutions informed by the literature. For a rather comprehensive list of references by topics, visit http://www.upa.pdx.edu/IOA/newsom/semrefs.htm. Serious SEM users should stay abreast of the current developments as SEM is still growing in its estimation techniques and expanding in its applications.

**Notes**

**[1]** When a model involves feedback or reciprocal relations or correlated residuals, it is said to be nonrecursive; otherwise the model is recursive. The distinction between recursive and nonrecursive models is important for model identification and estimation.

**[2]** The term error variance is often used interchangeably with unique variance (that which is not common variance). In measurement theory, unique variance consists of both “true unique variance” and “measurement error variance,” and only measurement error variance is considered the source of unreliability. Because the two components of unique variance are not separately estimated in measurement models, they are simply called “error” variance.

**[3]** This principle of identification in SEM is also known as the t-rule (Bollen, 1989, p. 93, p. 242). Given the number of p observed variables in any covariance-structure model, the number of variances and covariances is p(p+1)/2. The parameters to be estimated include factor loadings of measurement models, path coefficients of structural relations, and variances and covariances of exogenous variables including those of residuals. In the math ability example, the number of observed variances and covariances is 7(8)/2 = 28 and the number of parameters to be estimated is 15 (5 loadings + 2 path coefficients + 3 variance–covariance among predictors + 6 residual variances – 1 to set the scale of the latent factor). Because 28 is greater than 15, the model satisfies the t-rule.

**[4]** It is not uncommon to have missing observations in any research study. Provided data are missing completely at random, common ways of handling missing data, such as imputation, pairwise deletion, or listwise deletion can be applied. However, pairwise deletion may create estimation problems for SEM because a covariance matrix that is computed based on different numbers of cases may be singular or some estimates may be out-of-bound. Recent versions of some SEM software programs offer a special maximum likelihood estimation method (referred to as full-information maximum likelihood), which uses all available data for estimation and requires no imputation. This option is logically appealing because there is no need to make additional assumptions for imputation and there is no loss of observations. It has also been found to work better than listwise deletion in simulation studies (Kline, 2005, p. 56).

**[5]** When this distributional assumption is violated, parameter estimates may still be unbiased (if the proper covariance or correlation matrix is analyzed, that is, Pearson for continuous variables, polychoric, or polyserial correlation when categorical variable is involved) but their estimated standard errors will likely be underestimated and the model chi-square statistic will be inflated. In other words, when the distributional assumption is violated, statistical inference may be incorrect. Other estimation methods that do not make distributional assumptions (e.g., the asymptotically distribution-free estimator or weighed least squares based on the full asymptotic variance–covariance matrix of the estimated variances and covariances) are available but they often require unrealistically large sample sizes to work satisfactorily (N > 1,000). When the sample size is not that large, a viable alternative is to request robust estimation from some SEM software programs (e.g., LISREL8, EQS6, Mplus3), which provides some adjustment to the chi-square statistic and standard error estimates based on the severity of non-normality (Satorra & Bentler, 1994). Statistical inference based on adjusted statistics has been found to work quite satisfactorily provided sample size is not too small.

Introduction.

Processing speed, as Rindermann et al. (**2011**) describe it, is assumed to have a biological basis. This may explain why ECTs are less amenable to learning and personality factors (Rindermann & Neubauer, **2001**; Jensen, **2006**, pp. 175-178), to the Flynn effects (Nettelbeck & Wilson, **2004**; Woodley et al., **2013**), education gains (Ritchie & Bates, **2013**), and also why repeated retesting effects on ECTs, due to their nature of being knowledge-free, do not show any improvement while the conventional IQ tests are much more (positively) impacted by retesting even if the g component is **not affected**. This could also explain why speed mediates the relationship between IQ and death while on the other hand, smoking, education, and social class had a small contribution in comparison (Deary & Der, **2005**). Still related with the biological question, Penke et al. (**2012**) noted that the correlation between indicators of white matter tract and g was mediated by processing speed. This clearly draws some pictures on the relationship between speed and g. But curiously, Johnson & Deary (**2011**) suggest that speed can correlate with g through specific cognitive abilities, to the extent that speed may not be related with specific abilities through g, which seems coherent with models assuming g as an emergent construct, such as mutualism (van der Maas, **2006**). This result is very ambiguous due to the difficulty of determining a clear winner between the models. But if true, this may rise some concerns about the views of g as a latent construct (or factor).

But given this, anyway, it would be interesting to investigate all possible mediations due to processing speed, and better, to investigate if the causal link starting from speed of processing to IQ to achievement is produced (or not) by genetic effects. This has been answered directly by Luo et al. (**2003a**, **2003b**) SEM-based study. There was evidence that processing speed (represented by the Cognitive Abilities Test (CAT-g) chronometric measures) mediates g (WISC-g) in predicting scholastic performance (on the MAT). Some tests of the CAT were measuring components unrelated with mental speed. When these so-called ‘percent-correct’ (non-chronometric) variables have been removed from the CAT-g, becoming thus a purely chronometric measure, the strength of its mediation shows only a slight decline. They also discovered that processing speed relates to scholastic achievement mainly through genetic pathways.

Rohde & Thompson (**2007**) use multiple regression to assess the independent contribution of processing speed (as independent var.) on achievement tests such as GPA, WRAT-III, SAT combined, SAT-math, SAT-verbal (as dependent var.) when controlling for Raven and Mill Hill Vocabulary Scales (independent var.) as measures of cognitive abilities. In predicting these achievement measures, processing speed accounts for a small increase in R². Its contribution was only strong for SAT-math (R²=0.132). Other research (Vock et al., **2011**; Rindermann & Neubauer, **2004**) arrived at a similar conclusion. Speed partially mediated IQ in predicting achievement, leaving not much room for an independent link. Still, speed mediates intelligence but it also has a non trivial predictive validity above the effect of crystallized IQ in predicting scholastic achievement, as noted by Luo et al. (**2006**). Finally, Dodonova & Dodonov (**2013**) found a different result, with hypothesis of IQ and speed (each) having unique contribution (on school achievement) which seems to better explain their data.

With regard to the contribution of processing speed in the black-white IQ difference, only the Pesta & Poznanski (**2008**) study comes to mind. They administered RT and IT tests, yielding four variables, RT and IT means, as well as RTSD and ITSD, or the standard deviation of RT and IT scores, also called intra-individual variability. They factor analyzed these four variables, yielding an ECT factor score. This ECT partially mediates the BW difference in the Wonderlic Personnel Test by an amount of 49%. The ECT however did not mediate BW difference in GPA.

Methods.

Of use here is the multiple regression method, with age and gender controlled (always used in Model 1, or Step 1 in SPSS). Speed is included in model 1, with g added in model 2. In this way, we could see how much the independent, unique contribution of speed had diminished. Subsequently, g is included in model 1 and speed in model 2. In this way, we could see how much the independent contribution of g had changed from model 1 to model 2. The adjusted R² (because it is less biased) is also displayed; it shows the increment in variance explained by model 2 compared to model 1.

Speed is measured here by standardizing and then by averaging Numerical Operations and Coding Speed, the two speeded tests of the ASVAB. The remaining subtests are factor analyzed to give a g factor. As always, outliers have been removed, i.e., cases having a z-score equal or less than -3 SD.

The other variables of use are **PIAT-math** (the ‘updated’ standard score variable), **SAT-verbal**, **SAT-math**, **overall_GPA**, and the **ASVAB**.

Results.

Data and syntax below :

**On the partial mediating role of processing speed between black-white differences, IQ and GPA (EXCEL)**

**On the partial mediating role of processing speed between black-white differences, IQ and GPA (NLSY79 syntax)**

**On the partial mediating role of processing speed between black-white differences, IQ and GPA (NLSY97 syntax)**

In the NLSY97, the BW (beta) coefficient was 0.491 in model 1, decreasing at 0.369 when the speed factor is added in model 2. This accounts for only 1-(0.369/0.491) =0.248 of the initial gap. The respective number for GPA as dependent variable was 0.310. On the other hand, in the NLSY79, the speed factor accounted for a much larger BW gap in g : 1-(0.281/0.495) =0.432. This number is somewhat close to what Pesta & Poznanski found. In model (step) 2, the increment in the (adjusted) R² is about 27% in NLSY79 and 25% in NLSY97 for the inclusion of speed. For overall GPA (in NSY97) the increment in R² amounts only to 12%.

Not shown in the spreadsheed, in NLSY97, the results for hispanic-white (HW) gap is similar concerning g with 1-(0.239/0.350) = 0.320 of the gap accounted for by speed. The number for black-hispanic (BH) gap is smaller, with 1-(0.157/0.172) = 0.09. For the NLSY79, the HW gap diminished by 1-(0.200/0.339) = 0.410, and the BH gap by 1-(0.060/0.147) = 0.408.

Concerning the independent contribution of speed in predicting GPA when controlling for g, this unique effect is greatly reduced in the white, black, and hispanic sample as well. Speed has a modest predictive validity above g. The increment in (adjusted) R² is generally about 1-2% with g controlled for all groups. The increment in the same R² is about 2%, 7%, and 4% or blacks, hispanics and whites, respectively, when controlling for SAT-verbal. The respective numbers for SAT-math are 1%, 3% and 1%. And the respective numbers for PIAT-math are 3%, 3% and 4%. The increment in R² when SAT (verbal or math) is entered in Model 2 instead yielded a stronger incremental validity. However, R² increment for PIAT-math is similar to that obtained for speed.

Limitations.

Unlike path analysis, multiple regression cannot decompose the correlations into direct and indirect effects to give a clear picture on how these relationships are actually working by examining all the possible indirect paths. But this is the least of the problems.

One would question whether psychometric processing speed is comparable to chronometric processing speed as assessed by RT and IT. Coyle (**2011**) uses these subtests as a measure of processing speed and was able to find that the speed factor mediates the effect of age on g. That is, the model with no mediation fitted much worse than a model positing speed as a mediator between age and g. This appears curious because NO and CS consistently have one the lowest g-loadings together with Auto & Shop Info among the ASVAB subtests (Jensen, **1985**, table 5; Hu, **Sept.21.2013**). Besides, Jensen (**1998**, pp. 224-225, 236; **2006**, pp. 157, 178) informs us the following about the ASVAB :

Some psychometricians have mistakenly believed that RT measures the same speed factor that is measured by highly speeded psychometric tests, such as clerical checking, number series comparisons, and simple arithmetic. In fact, such tests have lower correlations with RT than do nonspeeded power tests. The two most speeded subtests out of the ten subtests of the Armed Services Vocational Aptitude Battery (ASVAB), for example, have repeatedly shown the lowest correlations with RT, yet these tests are typically identified with the speed factor that appears in factor analyses of various speeded and nonspeeded psychometric tests.

The fact is that psychometric speed – better called test-taking speed – is something entirely different from the speed of information processing measured by RT or IT. RT and IT have their highest correlations with pure power tests. The explanation for this seeming paradox is that the speed of information processing is a large part of g, whereas test-taking speed is not – it is more a personality factor than a cognitive factor. One of my studies found that the time taken by university students to complete the Raven’s matrices, when instructed to take all the time they need and to attempt every item, was not significantly correlated with their Raven scores (number right), nor was test-taking time significantly correlated with RT, but it was significantly correlated (r = -.45) with Extraversion as measured by the Eysenck Personality Inventory, which was not significantly correlated with RT. The personality trait of “conscientiousness” is probably also related to test-taking speed, but this has not yet been investigated. In all such correlations involving time, the variable of age must be controlled, as both test-taking speed and RT gradually change for the “worse” with increasing age beyond early adulthood. [31] There is, of course, a wide range of individual differences in the rates of this change with aging, which has the effect of increasing the correlations between all speeded tests in elderly people.

If true, the above finding may not be valid. Unfortunately, I don’t have much information on this issue that would give the definitive answer, apart from the small study by Larson (**1988**, Table 4) where Numerical Operations and Coding Speed subtests did not correlate with IT or RT and poorly with some other speed measures, in comparison to the other ASVAB subtests. According to Danthiir et al. (**2005**, **2012**) however, psychometric speed tests seem to be strongly related (near unity) to a general mental speed factor derived from several ECTs.

On the other hand, if we think these speeded tests are quite comparable to what RT and IT actually measure, there is a good chance that the strength of the mediation accounted for by speed is under-estimated here. More tests involved in the construction of a (latent) factor would decrease error variance. For example Grandy et al. (**2013**) made this specific point to explain why the studies on the correlation of intelligence and individual alpha peak frequency (IAF) yielded so much confusions and conflicts. A better example is illustrated by Betjemann et al. (**2010**). Contra Leeuwen et al. (**2009**), they found a positive relationship between processing speed and brain volume. The difference is that the former computed a speed factor with 4 measures while the latter uses only 2 measures.

(One last thing : it must be noted that RT is devoid of information or culture content unlike most of paper-pencil IQ tests, and the ASVAB is not an exception. Even if speed partially mediates BW difference, it is unlikely to be relevant to Spearman’s hypothesis.)

Conclusion.

There is a need to distinguish between psychometric and chronometric speed. Coyle (**2011**) is a good example of this whole problem, because if we accept that distinction, one would wonder why Kail (**2007**), Nettelbeck (**2010**) and some earlier studies (e.g., Fry & Hale, **1996**; Jensen, **2006**, pp. 91-94, 104), found the same results as that reported by Coyle, with the difference that they have also used chronometric speed tests.

Generally speaking, studies investigating the questions surrounding processing speed may have sometimes yielded heterogeneous results, depending on the specific topic, probably because the latent speed factor is poorly and differently constructed among studies, or the use of different procedures such as using individual measures as such versus latent factors created from these individual measures, but perhaps more generally because they use different measures which may have different properties.

]]>Kevin M. Beaver, John Paul Wright (2011)

Abstract

Research has consistently revealed that average IQ scores vary significantly across macro-level units, such as states and nations. The reason for this variation in IQ, however, has remained at the center of much controversy. One of the more provocative explanations is that IQ across macro-level units is the result of genetic differences, but empirical studies have yet to examine this possibility directly. The current study partially addresses this gap in the literature by examining whether average IQ scores across thirty-six schools are associated with differences in the allelic distributions of dopaminergic polymorphisms across schools. Analysis of data drawn from subjects (ages 12–19 years) participating in the National Longitudinal Study of Adolescent Health provides support in favor of this perspective, where variation in school-level IQ scores was predicted by school-level genetic variation. This association remained statistically significant even after controlling for the effects of race.

1. Introduction

Substantial variation exists across macro-level units for virtually every measurable characteristic. For example, research has revealed that indicators of wealth, measures of health, and crime rates vary significantly across neighborhoods, states, and nations (Beaver & Wright, 2011; Kanazawa, 2008; McDaniel, 2006; Pesta, McDaniel, & Bertsch, 2010). This same line of research has also documented that variation and inequality tend to be the most pronounced at the level of the nation. Stated simply, some nations are rich and others are poor; some nations are healthy and others are not; some nations have high rates of crime and others have low rates of crime (Braithwaite, 1989). The question that has plagued researchers, however, is what accounts for such disparities. Most of the explanations that have been advanced to explain nation-level differences have focused on culture, socialization, access to resources, and other socio-environmental factors (e.g., Diamond, 1997; Messner & Rosenfeld, 1994).

Perhaps the most controversial explanation for inequality across nations was advanced in Lynn and Vanhanen’s (2002) book, IQ and a Wealth of Nations. In this book Lynn and Vanhanen empirically examined the association between the average IQ of the nation and measures of wealth. The result of their analyses revealed a statistically significant association, where nations with higher average IQ scores tended to have more wealth than nations with lower IQ scores. More recently, they expanded their analyses and examined whether nation-level IQ scores were related to other measures of inequalities, such as educational level, life expectancy, and literacy rates (Lynn & Vanhanen, 2006). Their results once again indicated a statistically significant association between IQ and an assortment of measures of inequality.

With evidence mounting in favor of the position that nation-level IQ scores are related to inequality across nations, the next logical question to ask is what accounts for variation in IQ across nations? Lynn and Vanhanen (2002, 2006) (see also Hart, 2007; Rushton, 1997) advanced a very provocative and controversial claim that variation in nation-level IQ scores is produced by genetic variation across nations. Much of the evidence that they cite and discuss in relation to this claim, however, centers on heritability estimates that were generated using data at the individual level. For example, Lynn and Vanhanen (2006) describe the results of twin studies showing that IQ is about .75, meaning that about 75% of the variance in IQ is due to genetic factors. Although the evidence indicating that variation in individual-level IQ scores is due largely to genetic factors is overwhelming, the connection between individual-level IQ scores and nation-level IQ scores is not entirely clear. Heritability estimates are point estimates that are designed to explain variance generated from individual scores and thus whether these results can be extrapolated to higher levels of aggregation remains to be determined.

The goal of the current study is to provide a partial test of Lynn and Vanhanen’s (2002, 2006) thesis that variation in IQ scores at the nation level is the result of genetic differences. Our analysis focuses on examining whether polymorphisms in dopaminergic genes are related to IQ scores. We employed dopaminergic genes because prior research has provided some theoretical and empirical evidence linking the dopaminergic system, including dopaminergic polymorphisms, to cognitive abilities and IQ (Beaver, DeLisi, Vaughn, & Wright, 2010; Berman & Noble, 1995; Previc, 1999).

Due to data limitations we were unable to obtain data that included IQ scores at the nation level and genetic data at the nation level. We were, however, able to locate data that included IQ scores and DNA markers that could be aggregated to the school level. In this way, we were able to test whether variation in IQ scores at the school level was associated with variation in DNA markers that were aggregated to the school level. While using data aggregated to the school level cannot be considered a definitive test of Lynn and Vanhanen’s hypotheses at the nation level, the results based on schools can be considered an initial test of their statements for two main reasons. First, schools, like nations, show tremendous variation in terms of health, wealth, crime, and even IQ (Herrnstein & Murray, 1994; Saab & Klinger, 2010; Weissberg, 2010). Second, Lynn and Vanhanen’s (2002) arguments linking IQ to various outcomes have been shown to exist at levels of aggregation other than the nation, including the state level and the county level (Beaver & Wright, 2011; Kanazawa, 2008; McDaniel, 2006; Pesta et al., 2010). It is quite likely, then, that Lynn and Vanahanen’s explanation may apply to all types of aggregate units of analysis, not just nations. We use this possibility as a springboard to provide the first partial test of Lynn and Vanhanen’s provocative thesis that IQ varies across nations because of variation in genetic factors.

2. Method

2.1. Sample

Data for this study come from waves 1 and 3 of the National Longitudinal Study of Adolescent Health (Add Health). The Add Health is a four-wave study of a nationally representative sample of American youths who were enrolled in seventh through twelfth grades during the 1994–1995 school year (Udry, 2003). Multi-stage stratified sampling techniques were employed to select 132 middle and high schools included in the study. Students attending these schools were administered a self-report survey during a specified school day. More than 90,000 youths were included in the wave 1 in-school component of the Add Health study. A subsample of youths was then selected to be reinterviewed at their homes to gain more detailed information. Altogether, 20,745 adolescents participated in the wave 1 in-home component of the study (Harris, Florey, Tabor, Bearman, Jones, & Udry, 2003).

One of the distinguishing features of the Add Health data is that at wave 3 a subsample of respondents was genotyped. To be eligible for participation in the DNA subsample, respondents had to have a sibling who was also included in the study. If they were eligible, and if they agreed to participate, then they submitted samples of their buccal cells to be genotyped. Genotyping was conducted in coordination with the Institute of Behavioral Genetics in Boulder, Colorado and Add Health. In total, more than 2500 participants were included in the DNA subsample of the study (Harris, Halpern, Smolen, & Haberstick, 2006).

The final analytic sample consisted of only schools where there were at least 19 students who were included in the sample. In that way, the school-level estimates, which were based on aggregated individual-level scores, were less subject to variability associated with small sample sizes. After removing schools with less than 19 students, we were left with a final analytical sample of 1265 youths nested within 36 schools. Given that our analysis is based on a small sample size (N=36 schools), the power to detect small-to-moderate effect sizes is compromised. Any statistically significant effects that are detected will thus be moderate-to-large in magnitude.

2.2. Measures

2.2.1. School-level IQ scores

At wave 1, Add Health participants completed the Picture Vocabulary Test (PVT). The PVT is an abbreviated version of the full-length Peabody Picture Vocabulary Test (PPVT), a test used to assess verbal abilities and receptive vocabulary. The PVT measure has been used previously as a measure of IQ among researchers analyzing the Add Health data (Rowe, Jacobson, & Van den Oord, 1999). School-level IQ was estimated by aggregating and averaging individual PVT scores at the school level. A similar technique has been used previously to estimate county-level IQ (Beaver & Wright, 2011). The final score represents the average IQ score for respondents attending that school. The average school-level IQ was 99.08 with a standard deviation of 7.54.

2.2.2. School-level dopamine scores

To estimate school-level dopamine scores, we aggregated and averaged (at the school level) genotypic scores for three dopaminergic polymorphisms: one in the dopamine transporter gene (DAT1), one in the dopamine D2 receptor gene (DRD2/ANKK1), and one in the dopamine D4 receptor gene (DRD4). Detailed information about the genotyping of these polymorphisms is available elsewhere (Beaver, Vaughn, Wright, DeLisi, & Howard, 2010; Hopfer, Timberlake, Haberstick, Lessem, Ehringer, Smolen et al., 2005). We used prior research examining the link between dopaminergic genes and cognitive abilities to determine which alleles should be coded as the risk alleles (Beaver, Vaughn et al., 2010). Briefly, respondents were genotyped for a 40 base pair variable number of tandem repeats (VNTR) in the 3′ untranslated region of DAT1 (SLC6A3). For this polymorphism, the 10R allele was coded as the risk allele, while the 9R allele was coded as the non-risk allele. Following the lead of prior researchers (Hopfer et al., 2005), alleles other than the 9R and 10R were removed from the analysis. The second dopaminergic polymorphism included in the current study was the DRD2/ANKK1 TaqIA polymorphism. For DRD2/ANKK1, the A1 allele was scored as the risk allele and the A2 allele was scored as the non-risk allele. Last, DRD4 has a 48 base pair VNTR located at 11p15.5 on exon III. Two groups of alleles were created: one that included the 2R, 3R, 4R, 5R, and 6R alleles and one that included the 7R, 8R, 9R, and 10R alleles. The group of alleles that contained alleles of 7R or greater were coded as the risk alleles while the other group of alleles were coded as the non-risk alleles. All of the polymorphisms were coded codominantly, where the value indexed the number of risk alleles that each respondent possessed (0 risk alleles, 1 risk allele, or 2 risk alleles).

Following prior research indicating that the combination of genes (as opposed to each gene in isolation) tends to have the strongest and most consistent effects on human phenotypes (Belsky & Beaver, 2011; Li et al., 2010), we created an additive dopamine index (Beaver, Vaughn et al., 2010). To create this index, the scores for each of the three polymorphisms were summed at the individual level with values ranging between zero (0) and six. We then aggregated and averaged the individual-level dopamine scores for each school. The average school-level dopamine score was 2.54, with a standard deviation of 0.28.

2.2.3. Percentage African American

We included a measure of percentage African American in the analyses as a control variable. To create this measure, we aggregated and averaged scores on an individual-level self-reported race question, where 0=white and 1=African American. The resulting value indexed the percentage of African Americans who were attending the school.

2.3. Analytical strategy

The analysis for this study was conducted in two main steps. First, our analysis was focused on the interrelationships between IQ and dopaminergic polymorphisms at the individual-level. Specifically, we estimated the means and standard deviations for IQ by each genotype and we also examined the bivariate correlations between the dopaminergic genes and IQ scores. In addition, we examined whether IQ scores and dopaminergic scores varied across schools by calculating F-tests. The second step in the analysis was to estimate whether school-level dopamine scores predicted school-level IQ scores before and after controlling for percentage African American. To do so, we conducted ordinary least squares (OLS) regression models.

3. Results

The analysis begins by first estimating the association between each of the dopaminergic polymorphisms and individual-level IQ scores. Table 1 presents the means, standard deviations, sample sizes, and correlations for each of the genotypes. The results indicate that DAT1 and DRD2 maintain statistically significant and negative associations with IQ scores, while the effect of DRD4 on IQ is non-significant. To further explore the association between dopaminergic polymorphisms and IQ, we employed the additive dopamine scale as a predictor of IQ scores. The results of this analysis indicated a statistically significant and negative association between IQ and dopamine scores, where higher scores on the dopamine index correspond to lower IQ scores (r=−.15, p<.05, two-tailed test).

We continue our analysis of the individual-level data by examining whether IQ scores and dopamine scores vary significantly across the 36 schools. Our aggregate-level analyses hinge on significant variation across schools in both IQ and dopamine scores, otherwise it would be akin to trying to explain a constant with a constant, a variable with a constant, or a constant with a variable. The results of the F-tests revealed that IQ scores varied significantly across schools (F=11.227, p<.05) as do dopamine scores (F=2.239, p<.05). Fig. 1 reveals additional support that IQ scores and dopamine scores vary significantly across schools. The distributions in this figure reveal the scores for IQ and dopamine, respectively, across schools and clearly indicate a significant amount of dispersion for both variables.

The next set of analyses examines the association between school-level dopamine scores and school-level IQ. Model 1 in Table 2 shows the results of the bivariate analyses revealing a strong and statistically significant negative association between dopamine scores and IQ scores (as measured with a standardized regression coefficient [i.e., Beta]). Given that allelic distributions for certain genes and IQ scores both vary across race/ethnicity, it is possible that the results would be rendered spurious by the confounding effects of race. As a result, in Model 2 we introduce the percentage of African American variable. As can be seen, even after including race in the analysis, the partial correlation between school-level dopamine scores and school-level IQ scores remained large and statistically significant.

Last, to examine convergence in the results generated at the individual level with those generated at the school level, we plotted predicted IQ scores across scores on the dopamine scale index. The dopamine scale indexes were z-transformed so that the individual-level analysis could be compared with the school-level analysis. Fig. 2 portrays these plots and shows a high degree of convergence in the slopes and by implication the predicted values, where IQ scores decrease as the total number of risk alleles increases.

4. Discussion

Research has consistently revealed that IQ and other measures of cognitive abilities vary significantly across macro-level units of analysis, such as states, nations, and even schools. Although various explanations have been set forth to explain variation in IQ at the macro-level, the most controversial explanation is that genetic variation across macro-level units explains variation in IQ. To this point, however, empirical research had not directly examined this potential link. The current study partially addressed this gap in the literature by examining whether variation in IQ at the school level was associated with dopaminergic scores aggregated to the school-level. Analysis of data drawn from the Add Health revealed support in favor of this position, where schools that had higher dopamine scores were the same schools that had, on average, lower IQ scores.

Our results also examined the association between dopaminergic polymorphisms and IQ at the individual level. Consistent with prior research (e.g., Beaver, DeLisi et al., 2010; Berman & Noble, 1995), the associations between dopaminergic genes and individual-level IQ scores were either small and statistically significant or non-significant. Recall, however, that the association between school-level dopamine scores and school-level IQ scores was relatively large in magnitude, which necessarily begs the question of why the effects differed so markedly. While not exhaustive we offer two potential explanations. First, given the small sample size that was employed in the school-level analysis, our statistical power to detect small-to-moderate effect sizes was severely compromised and detecting large effect sizes could be due, in part, to methodological and statistical artifacts. We addressed this possibility by comparing the predicted values of IQ scores at the individual- and school-levels of analysis. The results of these models converged suggesting that the significant effects at the school-level are not solely due to a methodological or statistical artifact.

Second, it is well known that findings detected at one level of analysis cannot be extrapolated to other levels of aggregation (Piantadosi, Byar, & Green, 1988; Samuelson, 1955). This phenomenon is particularly salient in the social sciences where research often spans multiple units of analysis, but the effects can differ considerably among units of analysis (Kramer, 1983). Criminological research, for instance, consistently reveals a strong and robust association between poverty and crime rates among macrosocial units (e.g., states or neighborhoods), while the association between poverty and criminal involvement at the individual-level is weak and oftentimes non-significant. It is quite possible that this pattern also applies to genetic research, where the usual small effects of single genes detected at the individual level become much larger at higher levels of aggregation. Future research will need to explore this possibility in much greater detail.

To our knowledge, this is the first study to aggregate DNA markers to a unit of analysis higher than the individual. Moreover, this is the first study to our knowledge that has revealed that variation in aggregate IQ scores is associated with variation in aggregate DNA markers. These results are in line with Lynn and Vanhanen’s (2002, 2006) (see also Hart, 2007; Rushton, 1997) thesis that the average IQ of nations is the result of genetic differences across those nations. Of course, the current study used schools, not nations, as the unit of analysis, meaning that the results reported here may not generalize to other levels of aggregation, including the nation level. There is good reason to believe, however, that the association between DNA and IQ would be even stronger at the nation level in comparison with the school level. There is much more variation in both genetic markers and IQ scores cross-nationally than there is across schools. Schools in the current study were all drawn from the same country (i.e., the United States) creating more genetic homogeneity among schools than there is among nations. Given that nations can vary quite drastically in terms of the allelic distributions of certain genes (Cavalli-Sforza, Menozzi, & Piazza, 1994), it stands to reason that this increased genetic variation would be able to explain more of the variance in IQ scores. Future research is needed to address this issue more fully and examine whether the link between DNA markers and IQ scores would be detected at other levels of aggregation.

The results of the current study provide some of the first evidence indicating that IQ scores across macro-level units are the result of genetic factors. As with all research, though, the current study is host to at least three limitations that need to be rectified in follow-up studies. First, only three dopaminergic genes were used to create the dopamine scale. Although the dopaminergic systemhas previously been linked to IQ (Beaver, DeLisi et al., 2010; Berman & Noble, 1995; Previc, 1999), future research would benefit by examining a broader range of genes from the dopaminergic system and other systems of genes that may be linked to IQ. Second, the data that were available only allowed for IQ and DNA to be aggregated to the school level. It would be interesting to examine what types of associations are visible at other levels of aggregation, including the neighborhood level, the state level, and the nation level. Third, the measure of IQ was based on scores garnered from the PVT, a test designed to assess verbal skills. Whether the results would be observed using different measures of IQ is an empirical question awaiting future research. Until these limitations are addressed, the results of the current study should be interpreted with caution. If future researchers are able to replicate these findings, then the results would begin to provide additional support that cross-national inequalities may be produced, in part, by genetic variation.

]]>

Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan. …

It is tempting to see the priming fracas as an isolated case in an area of science — psychology — easily marginalised as soft and wayward. But irreproducibility is much more widespread. A few years ago scientists at Amgen, an American drug company, tried to replicate 53 studies that they considered landmarks in the basic science of cancer, often co-operating closely with the original researchers to ensure that their experimental technique matched the one used first time round. According to a piece they wrote last year in Nature, a leading scientific journal, they were able to reproduce the original results in just six. Months earlier Florian Prinz and his colleagues at Bayer HealthCare, a German pharmaceutical giant, reported in Nature Reviews Drug Discovery, a sister journal, that they had successfully reproduced the published results in just a quarter of 67 seminal studies.

But the first sentence of the following paragraph sounds unlikely to me :

Academic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. There are errors in a lot more of the scientific papers being published, written about and acted on than anyone would normally suppose, or like to think.

Because to acknowledge this is to lose its credibility, as a professor, as a scientist. More likely than not, they will surely admit of being wrong but not publicly. They will wait for the storm to pass, i.e., for others to forget that.

Various factors contribute to the problem. Statistical mistakes are widespread. The peer reviewers who evaluate papers before journals commit to publishing them are much worse at spotting mistakes than they or others appreciate. Professional pressure, competition and ambition push scientists to publish more quickly than would be wise. A career structure which lays great stress on publishing copious papers exacerbates all these problems. “There is no cost to getting things wrong,” says Brian Nosek, a psychologist at the University of Virginia who has taken an interest in his discipline’s persistent errors. “The cost is not getting them published.”

Statistical mistakes or misuse is a problem when it increases the variability of the results between studies, let alone the fact that they often use still other different procedures. This leads to the false conclusion, for most people (i.e., laypeople) not familiar with statistics, that a certain field of research does not yield promising results when in fact it was due to poor methodology.

Unlikeliness is a measure of how surprising the result might be. By and large, scientists want surprising results, and so they test hypotheses that are normally pretty unlikely and often very unlikely. Dr Ioannidis argues that in his field, epidemiology, you might expect one in ten hypotheses to be true. In exploratory disciplines like genomics, which rely on combing through vast troves of data about genes and proteins for interesting relationships, you might expect just one in a thousand to prove correct.

With this in mind, consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5% — that is, 45 of them — will look right because of type I errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong.

The negative results are much more trustworthy; for the case where the power is 0.8 there are 875 negative results of which only 20 are false, giving an accuracy of over 97%. But researchers and the journals in which they publish are not very interested in negative results. They prefer to accentuate the positive, and thus the error-prone. Negative results account for just 10-30% of published scientific literature, depending on the discipline. This bias may be growing. A study of 4,600 papers from across the sciences conducted by Daniele Fanelli of the University of Edinburgh found that the proportion of negative results dropped from 30% to 14% between 1990 and 2007. Lesley Yellowlees, president of Britain’s Royal Society of Chemistry, has published more than 100 papers. She remembers only one that reported a negative result.

Statisticians have ways to deal with such problems. But most scientists are not statisticians. Victoria Stodden, a statistician at Columbia, speaks for many in her trade when she says that scientists’ grasp of statistics has not kept pace with the development of complex mathematical techniques for crunching data. Some scientists use inappropriate techniques because those are the ones they feel comfortable with; others latch on to new ones without understanding their subtleties. Some just rely on the methods built into their software, even if they don’t understand them.

Statistical error, is a thing I have heard about before. As for myself, I could have made some mistakes when I started to play with data, e.g., by forgetting to check the distribution normality or presence of outliers in my data variables. But when remembering that, I noticed that in most papers I have read so far, in the method section for instance, there was no mention of normality distribution either with regard to the variables or residuals (in the case of regression analyses), and not to mention outliers. This makes me believe that there is a possibility that those factors may have been simply neglected by the authors. That’s a serious problem. Also annoying is the common use of inappropriate methods (see, Erceg-Hurn & Mirosevich, **2008**) partly because scientists may not be aware of the better methods newly proposed.

The last sentence in the above cited paragraph makes me think about something. Softwares, such as Stata or SPSS, are supposed to work through a specific syntax for computing or running correlation, regression, ANOVA, factor analysis, and so forth. But I remember well at the beginning that I had a lot problems with syntax in SPSS. I don’t know about scientists. Just imagine that they have typed the wrong syntax ? And worse, when researchers analyzed some survey data, say, NLSY, ECLS, Add Health, one would wonder why they do not tell us which variable they have used. Or which sampling weight they use (if they use it at all). Just name the variable ‘label’ does not require so much effort. But somehow, I can kind of understand. If they picked the wrong one, and show it, that’s the end. And this is not unlikely, because some survey data have multiple variables scattered everywhere, and some (sadly) must be collapsed into a single one. And that to choose the correct one, we need to read the codebooks (and note the “s” in the word) which may total more than 1000 pages. This is time-consuming and really exhausting. Sometimes, the data set is a real mess, let alone the fact that the variables do not always exclude missing answers (i.e., values) so that we need to correct this manually using the appropriate syntax, or otherwise the analysis is messed up. It is regrettable that the “method” section is generally so obscure about this. It is impossible to guess how they deal with the data. The less information the researchers give, the less likely someone else will discover the origins of the flaws. This is rather bothersome when researchers try to replicate another team of researchers when both of them do not mention the variables used or the procedure.

I tried a few times to request syntax from authors or ask which variables they use, but my success rate was 0% by now and I do not expect it to increase at a later date. Asking data is generally even worse. This is obvious. The more data we share, and the likelihood of being countered increases. As for myself, although I am not scientist, and hence not comparable at all, I always share EXCEL spreadsheets and syntax. I must be stupid. The sad thing is that I will continue.

This fits with another line of evidence suggesting that a lot of scientific research is poorly thought through, or executed, or both. The peer-reviewers at a journal like Nature provide editors with opinions on a paper’s novelty and significance as well as its shortcomings. But some new journals — PLoS One, published by the not-for-profit Public Library of Science, was the pioneer — make a point of being less picky. These “minimal-threshold” journals, which are online-only, seek to publish as much science as possible, rather than to pick out the best. They thus ask their peer reviewers only if a paper is methodologically sound. Remarkably, almost half the submissions to PLoS One are rejected for failing to clear that seemingly low bar.

PloS One is a journal from where I pick a lot of studies, as I see a lot of PloS One study. But what I believed to be a high-quality journal was very far from the reality. But more problematic is the apparent growing suspicions of fraud.

The number of retractions has grown tenfold over the past decade. But they still make up no more than 0.2% of the 1.4m papers published annually in scholarly journals. Papers with fundamental flaws often live on. Some may develop a bad reputation among those in the know, who will warn colleagues. But to outsiders they will appear part of the scientific canon.

The following paragraph however is more annoying and surprised me a little, to the extent that I consider that a peer-reviewed journal has a duty to check the article carefully and to provide a severe critique. I have imagined a jungle inhabited by ferocious beasts. But here’s the reality :

The idea that there are a lot of uncorrected flaws in published studies may seem hard to square with the fact that almost all of them will have been through peer-review. This sort of scrutiny by disinterested experts — acting out of a sense of professional obligation, rather than for pay — is often said to make the scientific literature particularly reliable. In practice it is poor at detecting many types of error.

John Bohannon, a biologist at Harvard, recently submitted a pseudonymous paper on the effects of a chemical derived from lichen on cancer cells to 304 journals describing themselves as using peer review. An unusual move; but it was an unusual paper, concocted wholesale and stuffed with clangers in study design, analysis and interpretation of results. Receiving this dog’s dinner from a fictitious researcher at a made up university, 157 of the journals accepted it for publication.

Dr Bohannon’s sting was directed at the lower tier of academic journals. But in a classic 1998 study Fiona Godlee, editor of the prestigious British Medical Journal, sent an article containing eight deliberate mistakes in study design, analysis and interpretation to more than 200 of the BMJ’s regular reviewers. Not one picked out all the mistakes. On average, they reported fewer than two; some did not spot any.

Another experiment at the BMJ showed that reviewers did no better when more clearly instructed on the problems they might encounter. They also seem to get worse with experience. Charles McCulloch and Michael Callaham, of the University of California, San Francisco, looked at how 1,500 referees were rated by editors at leading journals over a 14-year period and found that 92% showed a slow but steady drop in their scores.

As well as not spotting things they ought to spot, there is a lot that peer reviewers do not even try to check. They do not typically re-analyse the data presented from scratch, contenting themselves with a sense that the authors’ analysis is properly conceived. And they cannot be expected to spot deliberate falsifications if they are carried out with a modicum of subtlety.

Fraud is very likely second to incompetence in generating erroneous results, though it is hard to tell for certain. Dr Fanelli has looked at 21 different surveys of academics (mostly in the biomedical sciences but also in civil engineering, chemistry and economics) carried out between 1987 and 2008. Only 2% of respondents admitted falsifying or fabricating data, but 28% of respondents claimed to know of colleagues who engaged in questionable research practices.

Peer review’s multiple failings would matter less if science’s self-correction mechanism — replication — was in working order. Sometimes replications make a difference and even hit the headlines — as in the case of Thomas Herndon, a graduate student at the University of Massachusetts. He tried to replicate results on growth and austerity by two economists, Carmen Reinhart and Kenneth Rogoff, and found that their paper contained various errors, including one in the use of a spreadsheet.

I was used to believe that reviewers were much more picky than this. Again, this seems to seriously depart from the reality. After thinking about it however, it may not be surprising, as I heard here and there that scientists are always busy (teaching courses, conferences). Given this, it is likely that most reviewers will not proceed to an examination with much scrutiny. And yet this is not an excuse for not having noticed that the use of no-contact control group, in experiments studying the effect of working memory training on general intelligence, upwardly biases the effect size, known in this field as the Hawthorne (placebo) effect. Every scientist must know this.

I have even spotted some papers which had misreported their own numbers. For instance, the text says there is a relationship between x and y, gives a number, but the table reports otherwise, or they simply made an affirmation opposite to what is presented in their own tables. Somehow, this is rather confusing. This is mostly annoying when several persons were working together and that no one detected the error. It makes me think about whether or not some numbers included in the analyses could have been misreported sometimes. I noted that in this article **recently**. A slight modification of the numbers in one of the column vector can greatly affect the magnitude of the correlation, at least, when n is small. And obviously, the conclusion drawn from such analyses will be biased by misreported numbers.

With regard to fraud, I believe, at least in the topic of intelligence, genetics and race, that studies supporting hereditarianism or what some people call racialism are costly, and even more so when those studies were fraudulent. However, a study purporting to disrupt the hereditarian argument by misconduct and falsification, in my opinion, has certainly much less to fear about public’s opinion, including the peers, in comparison.

Anyway, in what way this article (The Economist, not mine) is useful is to remind naive people not to rely too much on a single study. I see this kind of things a million of times, usually blogs and forums. Those people do not care to provide a list of studies, or at least, a review. Meta-analysis, because is based on a theory of data (Hunter & Schmidt, **2004**, p. 30), is a good tool in the way that it helps to better understand the variability between the studies, and hence helps to understand how an effect size can be maximized (e.g., through the detection of moderators). But this needs to look beyond the level of a single study.

Les variables utilisées pour le GSS sont :

Variable dépendante :

Happy_Dichotomy. 0 = Not too happy, 1 = Happy.

Variables indépendantes (prédicteurs) :

SEX. 1 = MALE, 2 = FEMALE.

WORDSUM. Vocabulary test (un proxy for IQ, corrélation = 0.71; 0.83 pour g). Ce n’est pas une mesure de l’intelligence générale cependant, et le test ne contient que 10 item/questions, ce qui explique son faible coefficient de stabilité (0.73). Voir “Reliability and Stability Estimates for the GSS Core Items from the Three-wave Panels, 2006–2010” (Michael Hout & Orestes P. Hastings, **2012**).

SEI. Respondent socioeconomic index. (range : from ~17 to ~97)

good_health. 1 = Poor, 2 = Fair, 3 = Good, 4 = Excellent.

MARITAL_STATUS. 1 = Never married, 2 = Married.

POLVIEWS. 1 = Extremely liberal, 4 = Moderate, 7 = Extremely conservative. We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal – point 1 – to extremely conservative – point 7. Where would you place yourself on this scale?

ATTEND. 0 = Never, 2 = Once a year, 4 = Once a month, 7 = Every week, 8 = More than once week. How often do you attend religious services?

AGE. Respondent’s age.

COHORT. Birth cohort of respondent. (Note : les valeurs plus élevées de cette variable indiquent les cohortes plus récentes.)

BW. 1 = White, 2 = Black.

Un rappel concernant les analyses de régression. Si un coefficient de régression Beta ou B est positif (ou négatif) cela signifie que lorsque les valeurs de la variable indépendante augmente, les valeurs de la variable dépendante augmente (ou diminue), i.e., relation positive. Par exemple, si Health est positif, cela signifie que le fait de passer de Poor Health (valeur=1) à Fair Health (valeur=2) est associé positivement à une hausse des valeurs (0=Not too happy, 1=Happy) de la variable Happy.

Maintenant, le tableau de Kanazawa était le suivant :

Mon tableau peut être comparé :

Malgré le fait que Kanazawa utilise la régression ordinale tandis que j’utilise la **régression logistique**, les chiffres se ressemblent. L’échantillon des blancs présenté ici(N=5050) est large, et les chiffres semblables à ceux obtenus pour l’échantillon des noirs (N=719) excepté que Wordsum n’a aucun impact. Exp(B) se réfère aux changements de l’Odds Ratio attribué à la variable indépendante. Un Odds ratio supérieur à 1 dénote un impact positif, un Odds ratio inférieur à 1 dénote un impact négatif, et un Odds ratio égal à 1 dénote un effet nul (zéro). Le Wald est une sorte de statistique du Chi-Square.

Sur la question de savoir pourquoi il y a de multiples variables pour Health et Children, j’ai configuré SPSS pour définir ces deux variables comme étant variables catégoriques. La procédure ci-dessous :

Avoir configuré l’indicateur comme étant “first” signifie que la valeur la plus faible de la (ou les) variable(s) sélectionnée(s) sert de catégorie de référence, point de comparison pour les autres valeurs de la variable. Par exemple, Good_health (3) possède un Exp(B) de 10.694, ce qui représente l’effet (sur le bonheur) d’être classifié comme étant en “excellent health” (health=4) versus “poor health” (health=1). Good_health (2) possède un Exp(B) de 4.783, ce qui représente l’effet (sur le bonheur) d’être classifié comme étant en “good health” (health=3) versus “poor health” (health=1). Et ainsi de suite.

Toutes les variables, exceptées sex et marital status, ont été standardisées (exprimées en z-score ou standard deviation) afin de rendre la comparaison plus aisée. Car sans cette transformation, le coefficient B ou Exp(B) aurait été fonction d’une augmentation de 1 point dans la valeur des variables insérées. Par exemple, si les valeurs de la variable “revenu familial” vont de 1000 à 200 000 dollars (constant), les coefficients expriment l’effet d’augmentation de 1 dollar sur la variable dépendante (ici, Happiness). La variable Degree (non insérée ici) par exemple qui contient 5 valeurs pour les catégories Less than High School, High School, Junior College, Bachelor et Graduate, aurait un effet infiniment supérieur par rapport au revenu familial pour le simple fait que cette variable contient moins de valeurs (ou catégories). La hausse de 1 point équivaut de passer de Less than High School à High School, ce qui est par définition évidemment plus sensible que l’effet d’avoir 1 dollar supplémentaire.

C’est pour cela que les coefficients pour la variable “income” chez Kanazawa est proche de zéro, peut-être parce que sa variable est configurée comme ayant 12 catégories (ou valeurs). Quoi qu’il en soit, la variable Zsei possède un coefficient positif, exactement ce que j’ai trouvé en analysant le NLSY97 **précédemment**, où j’avais démontré que l’effet d’être en bonne santé est plus important néanmoins que le revenu. Ici encore, il semble que le fait d’être en bonne santé est d’une importance cruciale dans le fait d’être heureux. Le coefficient positif pour ATTEND signifie qu’être religieux est associé à un bonheur plus élevé. Le coefficient B négatif pour age et cohort signifie que les gens âgés et les cohortes récentes se disent être moins heureux.

Lorsque je limite l’échantillon pour les individus au niveau économique élevé (N=1978), le coefficient pour cohort devient positif alors même que ce coefficient reste négatif si je limite l’échantillon pour les individus à niveau économique bas, ce qui veut dire que les individus des cohortes récentes deviennent plus heureux chez les riches tandis que c’est l’inverse chez les pauvres. Ma meilleure interprétation est que la hausse des inégalités soit derrière tout ça. Le résultat pour l’échantillon des noirs est tellement anormal qu’il m’est impossible de commenter.

Concernant l’impact du statut économique (SES, ou SSE), la théorie de Kanazawa était que l’argent avait un impact positif plus faible chez les femmes que chez les hommes. ce qui est vrai puisque chez les blancs, le coefficient Zsei est plus élevé pour les hommes. Pour les noirs, le contraste est saisissant puisque l’effet positif du SES chez les hommes noirs (N=281) est très élevé mais se rapproche de zéro chez les femmes noires (N=438).

Nous voyons également que le fait d’avoir des enfants (et plus d’enfants) par rapport à la situation de ne pas en avoir est négativement associé au fait de se considérer heureux. Pour Marital Status, le coefficient est positif ce qui veut dire que le fait d’être marié accroît le sentiment de bonheur. Maintenant, j’ai aussi inclus une variable d’interaction, obtenu en multipliant Marital par Children. Comment interpréter cette interaction ? L’interaction exprime ce qui se passe lorsque :

children=0 marital=0

devient :

children=1 marital=1

En d’autres termes, lorsque les valeurs des deux variables augmentent simultanément. Puisque la variable d’interaction possède un signe positif, le fait “d’être marié et d’avoir des enfants” (simultanément) augmente le bonheur comparé au fait “de ne pas être marié et de ne pas avoir d’enfant” (simultanément). Cela signifie que le statut marital augmente le bonheur même en dépit d’avoir des enfants, puisque l’effet indépendant de celui-ci est négatif sur le bonheur. Si les deux variables avaient des signes positifs, alors le coefficient positif de leur interaction signifie qu’il existe un effet multiplicateur, ou effet de rendement croissant. Par exemple, si children et marital ajoutent respectivement +3 et +6 points de bonheur, le fait “d’avoir des enfants et d’être marié” ne produira pas +9 points de bonheur, mais plus.

L’analyse de régression logistique produit aussi d’autres chiffres, comme Hosmer and Lemeshow goodness of fit. Sa valeur est de 0.907. Largement plus élevé que le cut-off habituel de 0.05 de signifiance (dont le cut-off est purement arbitraire rappelons-le). Au delà de cette valeur, on considère que l’ajustement du model est bon, approprié. C’est visiblement le cas. Ceci dit, H&L test est extrêmement sensible à la taille de l’échantillon. Plus grand sera le nombre de sujets, et plus la valeur de H&L descendra. Je ne pense pas qu’il faille prêter beaucoup d’attention à ce test, par conséquent.

Ceci de côté, il y a aussi le Cox & Snell R Square et Nagelkerke R Square. Dans la mesure où le premier n’atteint jamais 1.0, il est sous-estimé, et il vaut mieux se fier au Nagelkerke. Leurs valeurs sont de 0.048 et 0.124 respectivement. Ce pseudo R² exprime la proportion de la variance non expliquée qui est réduite par l’ajout ou inclusion des variables insérées dans le modèle. Plus il est élevé, mieux c’est. Une autre information concernant la qualité du modèle est la “classification table” dont le total de pourcentage correct est de 93.4%. Ce tableau exprime la classification des valeurs observées par rapport aux valeurs prédites (par le modèle). La précision des prédictions du modèle est donc très élevée, et le graphique accompagné se présente comme suit :

Les valeurs de 0 et 1 sur l’axe x (horizontal) représentent les valeurs de la variable dépendante (Happy) dichotomisée. Les sujets qui ont une valeur de zéro apparaîtront du côté gauche et ceux ayant une valeur de 1 apparaîtront du côté droit. Lorsque les “points” ont tendance à se rassembler au centre, cela veut dire que ces sujets ont 50/50 pour cent de chance que les données sont prédites correctement par le modèle. Plus un modèle est précis dans ses prédictions sur les données actuelles, plus les valeurs 0 et 1 se rassembleront vers leurs extrémités respectives. C’est le cas. Les mauvaises classifications sont rares, ce qui explique le taux élevé (93.4%) de % correct.

Concernant l’explication de tous ces résultats, généralement, Kanazawa avance son interprétation :

Parents today must raise their children in a radically different environment from the EEA [Environment of Evolutionary Adaptadness]. They must drive them to and from daycare centres and soccer practices, they must put them through compulsory school and pay for their higher education, they must feed, clothe and shelter them in their adolescence and early adulthood (when they would have been economically independent in the EEA soon after puberty), they must purchase computers, cars and other expensive gadgets for them, etc. The list is endless. I suspect that having to raise children in an evolutionarily novel environment might suspend the operation of evolved psychological mechanisms (and the preferences, desires and emotions they engender) and allow other mechanisms to kick in and influence their happiness. Economic and sociological theories are indispensable in explaining these other mechanisms that might overtake and supersede evolved psychological mechanisms in the current environment.

Je suis généralement d’accord sur l’idée qu’il est plus difficile d’élever les enfants de nos jours. Une raison évidente est l’allocation du temps consacré au travail qui s’accroît au détriment du temps libre (càd, loisirs). Cela néanmoins n’explique pas l’impact négatif d’avoir des enfants.

Lorsque je limite l’échantillon aux individus de classes aisées, l’effet négatif d’avoir des enfants est légèrement plus important que lorsque je limite l’échantillon aux individus de classes non aisées. Aussi, que je limite l’échantillon aux gens mariés uniquement, ou non mariés uniquement, l’effet négatif d’avoir des enfants est largement supérieur chez les personnes non mariées.

Ces données concernant les USA ne doivent pas être généralisées aux autres pays, dans la mesure où un facteur qui rend les gens d’une région ou pays plus ou moins heureux ne rendra pas forcément heureux les gens d’un autre pays par la même ampleur. Les comparaisons de niveau de bonheur entre pays ne sont pas forcément valides puisque des facteurs confondants existent potentiellement, comme la mise en place des politiques, les cultures généralement, les facteurs génétiques éventuellement, ou les différences d’avancée technologique opérant entre les pays mais pas à l’intérieur du pays.

Ceci étant, le GSS est une étude transversale, et ne répond absolument pas à la question de la causalité. Une étude longitudinale, en revanche, aurait l’avantage de pouvoir détecter un possible effet négatif d’avoir des enfants sur le bonheur des parents, les années qui suivent la naissance de l’enfant. Myrskylä & Margolis (**2012**) possèdent ce genre de données pour l’Angleterre et l’Allemagne. Apparemment, les individus mariés, âgés, et plus éduqués, connaissent un bonheur plus élevé et durable. Dans leur analyse, les auteurs séparent les hommes et les femmes, les sujets fort éduqués et peu éduqués, mariés ou non mariés, et par tranche d’âge. C’est une manière classique mais efficace de détecter de possibles “modérateurs”. Un détail intéressant est le fait que le bonheur augmente sensiblement les années juste avant la naissance du 1er enfant. Les auteurs font la spéculation que les parents attendent un enfant, et/ou viennent de se trouver un partenaire, d’où la hausse du bonheur reporté. Aussi, avoir un deuxième enfant n’augmente pas sensiblement le bonheur tandis qu’avoir un troisième enfant conduit à une baisse substantielle du bonheur.

La question d’une possible causalité inversée a été adressée :

Second, we considered dynamic panel data models which take into account the possibility of reverse causation. The positive association between childbearing and happiness could be driven by the fact that happiness increases the probability of having a child, rather than the process of having a child increasing happiness. We first used a simple and intuitive check to test whether past happiness confounds the association between childbearing and current happiness by including lagged happiness (up to 3 years) as controls in the models. The results changed only marginally. Third, we used the standard Arellano-Bond dynamic panel models that add both lagged dependent variables to the model and instruments the key regression variables with their own lags to account for reverse causality (Arellano and Bond 1991). We considered various specifications of lag structures and results were nearly identical to those obtained without the dynamic structure.

Concernant leurs graphiques, je note que la Figure 1 montre des trajectoires du niveau de bonheur à tendance négative après naissance du 1er enfant, excepté pour les modèles à effets fixes qui ont la particularité de contrôler les caractéristiques fixes (e.g., facteurs confondants comme la personnalité) des sujets :

The longitudinal fixed-effects approach has several important advantages over crosssectional research. First, the approach is based on observing individual happiness trajectories over time, allowing us to analyze anticipation, short-term, and long-term changes in happiness with respect to a birth. Second, the approach allows controlling for individual-specific, timeinvariant unobserved characteristics, such as personality or genetic endowments, and eliminates the problem of selection into parenthood on happiness. Third, it allows observing the pattern of changes in life satisfaction while controlling for other changing factors, such as age, time, employment or marital status.

Les graphiques suivants utilisent les modèles à effets fixes. Néanmoins, dans la Figure 3B, chez les femmes allemandes âgés (35+), on ne détecte pas d’améliorations du bonheur comparé au niveau initial (3-5 ans avant la naissance du 1er enfant). Pour les hommes allemands, le niveau de bonheur reste élevé. Dans la Figure 5, les femmes allemandes ne montrent aucune amélioration de leur niveau de bonheur sur le long terme. Il n’y a pas de graphiques pour l’échantillon anglais, mais les auteurs avancent que les courbes sont sensiblement les mêmes. Par conséquent, les femmes britanniques et allemandes ne sont pas plus heureuses après naissance du 1er enfant tandis que les hommes le sont davantage, eux. Je soupçonne que la raison est que la femme doit se sacrifier davantage pour son enfant, par exemple, en laissant au 2ème plan son travail, sa carrière. Elle se sent donc moins libre (façons de parler).

Mais la question cruciale est de savoir quel type de variable les chercheurs ont utilisé pour évaluer le bonheur. Nous lisons :

Our key outcome is the subjective well-being of parents. In the German sample, respondents were asked annually, “How satisfied are you with your life, all things considered?” Responses range from zero (completely dissatisfied) to ten (completely satisfied). In the British sample, parental well-being is measured with two questions. The first measures general happiness and is based on the question “Have you recently been feeling reasonably happy, all things considered?” with responses ranging from one (much less happy than usual) to four (more happy than usual). The other question is “How dissatisfied or satisfied are you with your life overall,” with answers ranging from one (not satisfied at all) to seven (completely satisfied).

C’est la même question formulée dans le GSS. Et c’est là où est le problème. La façon dont nous formulons et présentons la question est importante et peut avoir un impact considérable sur l’interprétation qu’en auront les sujets. Daniel Kahneman (**2006**) explique ce qu’il appelle “focusing illusion” :

When people consider the impact of any single factor on their well-being — not only income — they are prone to exaggerate its importance; we refer to this tendency as the focusing illusion. Income has even less effect on people’s moment-to-moment hedonic experiences than on the judgment they make when asked to report their satisfaction with their life or overall happiness. These findings suggest that the standard survey questions by which subjective wellbeing is measured (mainly by asking respondents for a global judgment about their satisfaction or happiness with their life as a whole) may induce a form of focusing illusion, by drawing people’s attention to their relative standing in the distribution of material well-being. More importantly, the focusing illusion may be a source of error in significant decisions that people make. (4) …

Evidence for the focusing illusion comes from diverse lines of research. For example, Strack and colleagues (5) reported an experiment in which students were asked: (i) “How happy are you with your life in general?” and (ii) “How many dates did you have last month?” The correlation between the answers to these questions was -.012 (not statistically different from 0) when they were asked in the specified order, but the correlation rose to 0.66 when the order was reversed with another sample of students. The dating question evidently caused that aspect of life to become salient and its importance to be exaggerated when the respondents encountered the more general question about their happiness. Similar focusing effects were observed when attention was first called to respondents’ marriage (6) or health (7). One conclusion from this research is that people do not know how happy or satisfied they are with their life in the way they know their height or telephone number. The answers to global life satisfaction questions are constructed only when asked (8), and are therefore more susceptible to the focusing of attention on different aspects of life. …

Individuals who have recently experienced a significant life change — e.g., becoming disabled, winning a lottery, or getting married — surely think of their new circumstances many times each day, but the allocation of attention eventually changes, so that they spend most of their time attending to and drawing pleasure or displeasure from experiences such as having breakfast or watching television. (10) However, they are likely to be reminded of their status when prompted to answer a global judgment question such as, “How satisfied are you with your life these days?” …

Concernant la corrélation entre bonheur et revenu, le “bonheur quotidien” n’est pas corrélé avec le revenu alors même que les évaluations d’anxiété et de nervosité sont corrélées avec le niveau de revenu. L’explication de Kahneman est la suivante :

Finally, we would propose another explanation: as income rises, people’s time use does not appear to shift toward activities that are associated with improved affect. Subjective well-being is connected to how people spend their time. … People with greater income tend to devote relatively more of their time to work, compulsory non-work activities (such as shopping and childcare) and active leisure (such as exercise), and less of their time to passive leisure activities (such as watching TV). On balance, the activities that high-income individuals spend relatively more of their time engaged in are associated with no greater happiness, on average, but with slightly higher tension and stress.

La corrélation entre le “bonheur général” et le revenu est donc parfaitement illusoire. Cette variable n’est donc pas une mesure précise et adéquate du bonheur. Lorsqu’on leur pose cette question, les gens doivent probablement penser et se remémorer les événements rares et importants, ce qu’ils considèrent comme le tournant de leur vie, ces choses essentielles à accomplir dans toute vie : mariage, enfants, promotion au travail, diplôme universitaire. Lorsqu’ils le réalisent, les gens ont tendance à amplifier l’importance de ces événements. Mais si chacun de ces éléments est associé au stress quotidien, l’explication la plus probable est que chacun de ces éléments conduit à diminuer le bonheur. Les gens ont tendance à négliger tous ces détails, expériences quotidiennes quand on pose des questions qui leur rappellent leur statut social actuel.

La théorie du “focusing illusion” peut expliquer à la fois pourquoi, peu avant la naissance du 1er enfant et même les années suivantes, les individus ont le sentiment d’être plus heureux. La faiblesse de cette étude longitudinale est donc d’avoir utilisé une variable sans doute inadéquate pour évaluer le bonheur. Lorsque des analyses utilisant des mesures du bonheur “quotidien” et non “général” la parentalité diminue le bonheur (Powdthavee, **2009**). Certes, il s’agit d’une étude transversale et non longitudinale. Et pourtant, ce résultat illustre parfaitement pourquoi les parents se focalisent sur les événements (positifs) rares plutôt que sur les événements (négatifs) quotidiens en pensant de façon inconsciente que cela compense assez largement le stress vécu au quotidien :

Why do we have such a rosy view about parenthood? One possible explanation for this, according to Daniel Gilbert (2006), is that the belief that ‘children bring happiness’ transmits itself much more successfully from generation to generation than the belief that ‘children bring misery’. The phenomenon, which Gilbert says is a ‘super-replicator’, can be explained further by the fact that people who believe that there is no joy in parenthood – and who thus stop having them – are unlikely to be able to pass on their belief much further beyond their own generation. It is a little bit like Darwin’s theory of the survival of the fittest. Only the belief that has the best chance of transmission – even if it is a faulty one – will be passed on.

Cette explication a du sens. Ce phénomène peut aussi expliquer pourquoi le revenu est associé positivement au bonheur général. Dans la mesure où le haut statut social accroît le succès reproductif, nous pourrions être en quelque sorte “conditionné” évolutivement à penser que le statut et le succès sont toujours des choses positives alors qu’en vérité, c’est certainement l’inverse.

**SPSS syntax for GSS analysis :**

RECODE happy (1 thru 2=1) (3=0) INTO Happy_Dichotomy.

EXECUTE.

RECODE race (1=2) (2=1) (ELSE=SYSMIS) INTO BW.

EXECUTE.

RECODE health (1=4) (2=3) (3=2) (4=1) INTO good_health.

EXECUTE.

RECODE childs (0=0) (1=1) (2=2) (3=3) (4 thru highest=4) INTO NUMBER_CHILDREN.

EXECUTE.

RECODE MARITAL (1=2) (5=1) (ELSE=SYSMIS) INTO MARITAL_STATUS.

EXECUTE.

COMPUTE CHILDREN_AND_MARITAL=NUMBER_CHILDREN*MARITAL_STATUS.

EXECUTE.

COMPUTE wtssall_oversamp=wtssall*oversamp.

EXECUTE.

COMPUTE SQRTrealinc=SQRT(realinc).

VARIABLE LABELS SQRTrealinc ‘square root of R income in constant dollars’.

EXECUTE.

DESCRIPTIVES VARIABLES=age year good_health COHORT WORDSUM SEI realinc SQRTrealinc POLVIEWS health ATTEND

/SAVE

/STATISTICS=MEAN STDDEV MIN MAX.

FREQUENCIES VARIABLES=Zsei Zrealinc SQRTrealinc ZSQRTrealinc Zgood_health Zcohort Zage Zyear Zwordsum Zpolviews Zattend

/FORMAT=NOTABLE

/HISTOGRAM NORMAL

/ORDER=ANALYSIS.

WEIGHT BY wtssall_oversamp.

USE ALL.

COMPUTE filter_$=(BW=1).

VARIABLE LABELS filter_$ ‘BW=1 (FILTER)’.

VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.

FORMATS filter_$ (f1.0).

FILTER BY filter_$.

EXECUTE.

LOGISTIC REGRESSION VARIABLES Happy_Dichotomy

/METHOD=ENTER sex Zage Zcohort Zwordsum Zsei Zpolviews Zattend good_health NUMBER_CHILDREN MARITAL_STATUS

CHILDREN_AND_MARITAL

/CONTRAST (NUMBER_CHILDREN)=Indicator(1)

/CONTRAST (good_health)=Indicator(1)

/CLASSPLOT

/PRINT=GOODFIT CORR ITER(1) CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

WEIGHT BY wtssall.

USE ALL.

COMPUTE filter_$=(BW=2).

VARIABLE LABELS filter_$ ‘BW=2 (FILTER)’.

VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.

FORMATS filter_$ (f1.0).

FILTER BY filter_$.

EXECUTE.

LOGISTIC REGRESSION VARIABLES Happy_Dichotomy

/METHOD=ENTER sex Zage Zcohort Zwordsum Zsei Zpolviews Zattend good_health NUMBER_CHILDREN MARITAL_STATUS

CHILDREN_AND_MARITAL

/CONTRAST (NUMBER_CHILDREN)=Indicator(1)

/CONTRAST (good_health)=Indicator(1)

/CLASSPLOT

/PRINT=GOODFIT CORR ITER(1) CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

USE ALL.

COMPUTE filter_$=(BW=2 and Zsei<=0.5).

VARIABLE LABELS filter_$ ‘BW=2 and Zsei<=0.5 (FILTER)’.

VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.

FORMATS filter_$ (f1.0).

FILTER BY filter_$.

EXECUTE.

LOGISTIC REGRESSION VARIABLES Happy_Dichotomy

/METHOD=ENTER sex Zage Zcohort Zwordsum Zsei Zpolviews Zattend good_health NUMBER_CHILDREN MARITAL_STATUS

CHILDREN_AND_MARITAL

/CONTRAST (NUMBER_CHILDREN)=Indicator(1)

/CONTRAST (good_health)=Indicator(1)

/CLASSPLOT

/PRINT=GOODFIT CORR ITER(1) CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

USE ALL.

COMPUTE filter_$=(BW=2 and Zsei>=0.5).

VARIABLE LABELS filter_$ ‘BW=2 and Zsei>=0.5 (FILTER)’.

VALUE LABELS filter_$ 0 ‘Not Selected’ 1 ‘Selected’.

FORMATS filter_$ (f1.0).

FILTER BY filter_$.

EXECUTE.

/METHOD=ENTER sex Zage Zcohort Zwordsum Zsei Zpolviews Zattend good_health NUMBER_CHILDREN MARITAL_STATUS

CHILDREN_AND_MARITAL

/CONTRAST (NUMBER_CHILDREN)=Indicator(1)

/CONTRAST (good_health)=Indicator(1)

/CLASSPLOT

/PRINT=GOODFIT CORR ITER(1) CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

WEIGHT OFF.

FILTER OFF.

USE ALL.

EXECUTE.

My interpretation for this seemingly widespread argument is that it is the most easy one, consequently the most commonly used. But, something they surely don’t know, it is also the most silly one. First rule, of course, is to not over-generalize. Anecdote is not data. Maybe I am an outlier. It could be that it was not the most used argument. But whatever the case, I continue seeing it everywhere, again and again.

Anyway, the evidence from transracial adoption data is highly controversial, some having found results consistent with hereditarian model (Scarr & Weinberg, **1976**, **1983**; Scarr et al., **1993**; see also the comments on the Minnesota Transracial Adoption Study, Levin, **1994**; Lynn, **1994**; Waldman et al., **1994**) while some did not (Willerman et al., **1974**; Tizard et al., 1972, **1974**; Eyferth, 1961; Moore, **1986**). Those studies have been discussed at length elsewhere (Chuck, Feb. 20, **2011**). Whatever the final conclusion one would make, or would like to make depending on ideological inclinations, the samples are very small and most of the relevant informations on adoptees and adoptive/biological parents not available. None of the aforementioned studies provide full longitudinal information on adoptees and adoptive families. And yet, ignorant hereditarians cite this research as an established proof of racial genetic hierarchy. On the other side, however, I usually see that environmentalists have been trapped into the same fallacy as well. They cite transracial adoption data in support of their views without any care about 1) longitudinal data 2) biological parents’ characteristics. If adoption gain is empty in regard to g as was the case for **educational intervention programs**, we should expect vanishing gains over time. Besides, if shared environmental (c2) effects decrease over time, we may also expect vanishing gains. Hence the importance of follow-up data.

Concerning the above cited research, generally, one obvious missing parent data is parental background/IQ. For instance, Willerman et al. (**1974**) report that mixed race (BW) children have higher IQ if the mother’s race was white. They state, obviously, that white mothers provide better environments for the children. But Rushton & Jensen (**2005**, p. 262) mentioned that white mothers average one year of schooling higher than black mothers. Nisbett (**2005**) dismissed the argument on the grounds that one year of schooling is not a big deal. Still, in the NLSY97 and NLSY79, **I found** that blacks and whites differ by 1SD in IQ while the difference in parental education was about 1 year, no more. This suggests that the “years of education” variable is unlikely to explain much of the gap. This could be expected since education and IQ are not perfectly correlated. This tells us the importance of having both adoptive and biological parents’ IQ.

Discussed in Nisbett (**1995**, **2005**), there was the strange Tizard et al. study (1972), reviewed in Tizard (**1974**). They have produced a result that no one had succeeded to replicate. Shortly, they found a genetic advantage for blacks. Whites scored the lowest, blacks the highest, and the mixed race fall into intermediate. This is the exact opposite of what the literature shows. In the absence of replication, it would be imprudent to take this result at face value. Of course, hereditarians deliberately avoid this study. On the other side, environmentalists were obviously silly enough to over-generalize a small study that no one succeeded to reproduce. And to think that Tizard study placed the nail in the coffin of the hereditarian hypothesis.

Two other controversial studies cited were usually cited by environmentalists. The Eyferth (1961) and Willerman (**1974**). We can easily distrust Eyferth. That study shows nearly no difference between the BW and WW children. Biracial scored equally well with white adoptees. When looking at the gender groups, something unexpected happens (Jensen, **1998**, p. 482). There was an extremely large male-female gap in the white group. The white girls had a 8-points deficit with regard to the white boys. No such difference was found in the biracial groups. But generally, there is no difference in IQ between males and females in childhood (Rushton & Jensen, **2010**, pp. 24-25), thus calling into question the Eyferth sample (Mackenzie, **1984**, p. 1229). In adulthood, there might be an advantage for the males, but it is not clear whether or not this advantage is g-loaded (Jensen, **1998**, pp. 536-540; Flores-Mendoza et al., **2013**, Table 1). In reality, the BW-WW difference is null only because the white girls scored extremely low; when comparing the boys however, the BW-WW difference is consistent with the hereditarian hypothesis (HH). Willerman also displayed a similar curious pattern. It has been found that 4 years-old mixed race (BW) from W mothers with B fathers couples have a higher IQ than mixed race (BW) from B mothers with W fathers. If we take a look at their Table V, however, the 9 points difference is driven from the extremely low scores of black males. Besides, the sex difference is huge between BW males and BW females (6 points for the white mothered and ~20 points for the black mothered). Even the Eyferth study does not show any sex gap for the biracials. Furthermore, the BW-W IQ gap lacks coherence. There is virtually no difference between BW females of married black mothers and BW females of married white mothers, while the gap between BW males of married black mothers and BW males of married white mothers is about 17 points. The authors have nothing to say about this. Concerning the Moore (**1986**) study, who reported a null IQ difference between black and mixed-race raised by white families, Chuck affirmed (**Dec.13.2012**) that those numbers do not depart significantly from hereditarian predictions.

Sometimes, environmentalists accompany the transracial adoption studies with studies failing to establish the relationship between african ancestry and IQ. While they claimed that the old research support their views, these studies were methodologically flawed (Lee, **2010**; Reed, **1997**; Jensen, **1998**, pp. 478-481). It seems that these two direct tests did not provide any evidence for either environmental or genetic hypothesis, mostly due to methodological limitations.

On the other hand, the most cited study by hereditarians is surely the Scarr & Weinberg (**1976**) longitudinal study. As Locurto (**1990**) pointed out, there was also a lot of important missing data, e.g., biological parents’ IQ, rendering interpretation rather difficult. Having the biological parents’ SES data does not make it easy to estimate what the IQ of adopted children would have been if they were not adopted.

Overall, one can even argue that whatever the IQs of interracial children would be, they are probably depressed by issues related with psychological disturbance having to do with self-identity. This is what Nisbett mentioned (2009) but Lee replied (**2010**) that both blacks and mixed-race have raised their IQ when adopted in comfortable white home in the Moore (**1986**) study. One should not over-generalize that minor study. But even if Lee was right, there is still another issue. Adoptive parents may put more investment in (BW) children with 1 black, 1 white parent and less investment in (BB) children with 2 black parents. Scarr & Weinberg tested this hypothesis and found no evidence of such (parental) expectancy effect. The IQ of the 12 children wrongly believed by their adoptive parents to have 2 black parents was similar to the IQ of the 56 children correctly classified by their adoptive parents as having 1 black, 1 white parent. Like all other studies, the first obvious problem is the small sample size.

Even so, it does not necessarily rule out the colorism effect, which states that regardless of racial group, gender group, age group and all, darker skinned people tend to face more discrimination, not only at school, at the job, but also within the family. The colorism effect must be universal, affecting all people. One may even argue that parents favor lighter skinned children. First, skin color variation among sibling (i.e., within family) is probably small, or at least for sure, much smaller than skin color difference between families. One possible and neat test of colorism is to control for family effect. Because it is supposed to be universal, colorism would not predict the absence of skin color-IQ relationship at the within family level. In other words, differences in skin color between full siblings must have been correlated with IQ. When such **analysis** is done in the NLSY97, that correlation was too weak, and much weaker than skin color differences between groups of siblings between different families. These results were more supportive of the genetic prediction because, according to it, skin color is an index of parental ancestry or admixture. Obviously, admixture effect must have been statistically controlled when analyzing pairs of full siblings, hence the near absence of skin color-IQ correlation. But of course, that was something these transracial adoption studies did not reveal. Generally, if darker skinned people had been discriminated on the basis of their physical appearance, there must have been a correlation between skin color and wage or education even when demographic and socio-economic factors are held constant. However, the regression coefficient for skin color as predictor is usually close to zero in the **Add Health** and **GSS** sample.

On other possible moderators of adoption gains, van IJzendoorn (**2005**, pp. 308-309) meta-analysis is informative in this regard. They report several findings : that age of adoption (before versus after 12 months old) and type of adoption (domestic versus international) have no significant impact on IQ. They did find however a large effect size for environmentally deprived adopted children. They were only four studies, and not six as the authors affirmed (Colombo et al., 1992, N=27; Dennis, 1973, N=136; Schiff et al., 1978, N=52; Tizard & Hodges, 1978, N=39) all having small sample sizes (total N=254) and among them the Schiff study is probably fraudulent (Locurto, **1990**). A small group of children had been adopted by high-SES families, while the siblings of these children had been raised by their biological parents, the adopted children having an advantage of 16 IQ points. The problem was that those siblings were not full siblings in reality but mostly half-siblings, which would cause even less IQ resemblance between the two groups and even more so due to the fact that some of them were born illegitimate, and that they were not raised exclusively by their biological mothers but instead by nurses or grandparents, with the result that only a very few of these children were raised by their biological parents (mother + father). As would be expected, the non-adopted “siblings” have no family stability, unlike the adopted children. All those factors, according to environmentalists, are susceptible to lower children’s IQ. The control group was not adequate. Finally, if the fittest babies were likely to have been selectively adopted, the IQ difference between adopted and non-adopted children due to adoption gains is probably over-estimated.

Anyway, even in the US black population, the portion of abused children, or those dying from malnutrition, must be incredibly small. These large gains should not be over-generalized even to the blacks.

But this assumes first that the IQ gains must be g-loaded. If not, all the tenets of these research collapse. When Jensen (**1997**) analyzed Capron & Duyme adoption data (**1989**, **1996**) no evidence of g-gains had been **found**. Although the sample was very small, citing this study is a much better move than citing transracial adoption (useless) data. More recently, Jongeneel-Grimen & te Nijenhuis (2007, unpublished) meta-analyzed a collection of 4 adoption (data) IQ gains, totalling 691 subjects, with a true correlation of -0.95 (211% of variance explained by artifactual errors) which jumped at -1.05 after a final correlation for deviation from perfect construct validity; while the correlation is outside the normal range, this phenomenon can happen when applying artifact corrections (te Nijenhuis et al., **2007**). Generally, as Jensen (**1998**, pp. 476-477) summarized, there is no good evidence that environmental factors have large impact on IQ in adolescence/adulthood, except for the extreme, non-generalizable cases :

There is simply no good evidence that social environmental factors have a large effect on IQ, particularly in adolescence and beyond, except in cases of extreme environmental deprivation. In the Texas Adoption Study, [54] for example, adoptees whose biological mothers had IQs of ninety-five or below were compared with adoptees whose biological mothers had IQs of 120 or above. Although these children were given up by their mothers in infancy and all were adopted into good homes, the two groups differed by 15.7 IQ points at age 7 years and by 19 IQ points at age 17. These mean differences, which are about one-half of the mean difference between the low-IQ and high-IQ biological mothers of these children, are close to what one would predict from a simple genetic model according to which the standardized regression of offspring on biological parents is .50.

In still another study, Turkheimer [55] used a quite clever adoption design in which each of the adoptee probands was compared against two nonadopted children, one who was reared in the same social class as the adopted proband’s biological mother, the other who was reared in the same social class as the proband’s adoptive mother. (In all cases, the proband’s biological mother was of lower SES than the adoptive mother.) This design would answer the question of whether a child born to a mother of lower SES background and adopted into a family of higher SES background would have an IQ that is closer to children who were born and reared in a lower SES background than to children born and reared in a higher SES background. The result: the proband adoptees’ mean IQ was nearly the same as the mean IQ of the nonadopted children of mothers of lower SES background but differed significantly (by more than 0.5σ) from the mean IQ of the nonadopted children of mothers of higher SES background. In other words, the adopted probands, although reared by adoptive mothers of higher SES than that of the probands’ biological mothers, turned out about the same with respect to IQ as if they had been reared by their biological mothers, who were of lower SES. Again, it appears that the family social environment has a surprisingly weak influence on IQ. This broad factor therefore would seem to carry little explanatory weight for the IQ differences between the WW, BW, and BB groups in the transracial adoption study.

Even if the informations about the (biological and adoptive) parents and adopted children were so plentiful that we shouldn’t worrying about missing anything crucial to assess parents and children’s characteristics in a longitudinal way, the problem of the hypothetical dual hypothesis, which states that within-group (WG) differences and between-group (BG) differences have different independent causes, had to be dealt with.

This is because it could be argued that what constitute a good environment for whites is not necessarily a good environment for blacks, that is, blacks and whites are affected by the same environment in different ways. But when confronted to empirical data, no such racism effect emerges. Perhaps the most sophisticated method for assessing the effect of racism on minorities is by way of structural equation modeling (SEM) analyses. These have been performed by Rowe and his colleagues (**1994**, **1995**; & Cleveland, **1996**). Rowe (**2005**) discusses this research later :

The first research question was whether the covariance structures (i.e., the correlations among variables) were quantitatively the same in Blacks and Whites. For example, if Blacks had special causes of variation in mathematics that did not exist in Whites, then the total variance of math scores would be greater in Blacks than in Whites. Or if shared environmental effects were twice as strong in Blacks than in Whites, tests scores would correlate more highly within sibling pairs in Blacks than in Whites. Equal covariance matrices in the two populations, however, would imply a similarity of influences on academic achievement.

… The correlations among the three tests were nearly identical in the four groups (2 races x 2 sibling types), with the two verbal tests correlating approximately .80, and the math test correlating with each verbal test approximately .60. Sibling correlations were also on the same order of magnitude in equivalent groups. Hence, a striking similarity of the two races was observed: They were nearly identical in the association of the variables (Rowe, Vazsonyi, & Flannery, 1994; see also Jensen, 1998, pp. 350–530). As expected under a genetic hypothesis, correlations were greater for full siblings than for half-siblings. For instance, the sibling correlations on reading comprehension were .36 and .42 in White and Black full siblings, respectively, compared with .09 and .22 in White and Black half-siblings, respectively. The correlation pattern, however, did not always support a genetic hypothesis; but the Black sample was small, and thus its correlations had large standard errors. Because the method of maximum likelihood was employed, the structural equations’ fit used all the statistical information in the covariance matrices. In the best-fitting model, both the genetic and shared environmental latent variables were retained.

Once the equivalence of correlation matrices between Blacks and Whites has been established, a second step is fitting the racial means. In the model, the latent genetic and shared environmental factors were permitted to have a racial mean difference. The product of factor loadings of a test and this mean difference should reproduce the observed PIAT mean. Because the PIAT racial differences must be proportional to factor loadings for the model to be correct, where mean differences belong in a model of within-group variation can be tested statistically. A good fit increases one’s confidence in the explanation of mean differences.

On the PIAT subtests, racial mean differences ranged from 0.3 to 0.5 standard deviation units. This relatively small racial difference may reflect the sampling bias noted earlier (i.e., that the siblings were the offspring of young mothers). It is possible to calculate from the factor loadings and a factor’s mean difference the percentage of a test’s mean difference due to shared environment and to genes. In the best-fit SEM, the genetic factor accounted for 66%–74% of the racial mean difference in reading comprehension and reading recognition and 36% of the racial mean difference in mathematics, which was the test most strongly loaded on the shared environment factor. The shared environmental latent factor accounted for the remainder of the mean differences.

In sum, when minority and majority children attain a similar level of achievement/intellectual ability through a different development process, there will be a statistical significant between-race differences in the correlation of, say, IQ with achievement or any kind of familial environmental measures. But no difference between these correlations had been acknowledged at the between-race level (blacks, hispanics, whites, asians). What factors and relationships were important at the within-race level were also important at the between-race level, meaning that the environmental factors responsible for race differences originate from environmental factors operating at the individual level. This completely rules out the ‘racism argument’. With regard to psychometrics, strictly speaking, Dolan (**2000**, **2001**) showed that measurement bias was not present when comparing lower black IQ and higher white IQ in the USA, and Lubke (**2003**) explains that evidence of measurement equivalence can be taken for granted that the factors responsible for individual IQ differences and those responsible for racial IQ differences are of same nature. Transracial adoption data can’t rule out this argument however. Like correlational analyses, it provides absolutely no insight on causal pathways. Such data therefore is totally useless.

As seen above, a lot of questions are not answered directly by transracial adoption data. The only interest in those transracial adoption and mixed-race data is to show it does not contradict the hereditarian hypothesis. If we manage to establish a non-significant racial IQ difference between black adoptees and white adoptees from white parents, this contradicts indeed the hereditarian model but in no way a significant IQ difference would provide any support for a causal genetic model. It is not a bad move for environmentalists to cite those studies when they do not support the hereditarian hypothesis, but to think that when consistent with the said hypothesis, a causal pathway has been established is a foolish idea. This is bad move.

]]>Kenneth G. Brown, Huy Le and Frank L. Schmidt

University of Iowa

INTERNATIONAL JOURNAL OF SELECTION AND ASSESSMENT VOLUME 14 NUMBER 2 JUNE 2006

There has been controversy over the years about whether specific mental abilities increment validity for predicting performance above and beyond the validity for general mental ability (GMA). Despite its appeal, specific aptitude theory has received only sporadic empirical support. Using more exact statistical and measurement methods and a larger data set than previous studies, this study provides further evidence that specific aptitude theory is not tenable with regard to training performance. Across 10 jobs, differential weighting of specific aptitudes and specific aptitude tests were found not to improve the prediction of training performance over the validity of GMA. Implications of this finding for training research and practice are discussed.

Training is essential in today’s work organizations to help employees keep pace with rapid changes in the social, legal, and technical environments (Callanan & Greenhaus, 1999; Salas & Cannon-Bowers, 2001). From the organization’s perspective, training is an investment in employees, so understanding which employees benefit most from training is critically important. Research on this question has focused on many different trainee characteristics (Colquitt, LePine, & Noe, 2000; Noe, 1986), but the largest effects have been for general mental ability (GMA). GMA is often called intelligence and it is the common factor underlying performance on all mental ability tests (Jensen, 1998). Over the past 10 years there has been substantial theoretical and empirical progress in the study of GMA, and it is considered by many to be the best validated individual difference construct in psychology (Lubinski, 2000; Schmidt, 2002).

There has been some controversy over the years about whether specific mental abilities, measured by the tests that are used as indicators of GMA, are useful for predicting performance above and beyond the general factor (Ree, Earles, & Teachout, 1994). Many authors have proposed that differential weighting (such as via regression) of specific ability tests should yield better prediction of job and training performance than measures of GMA. This hypothesis is referred to as specific aptitude theory or differential aptitude theory, and it has been around for quite some time (Hull, 1928; Thurstone, 1938). Examples of it in practice and research are easy to provide. For example, a trainer who believes that results from a spatial ability test would predict performance in a computer-aided design course better than GMA, or that results from a vocabulary test would predict performance in a communication course better than GMA, subscribes to specific aptitude theory (see Schmidt, 2002, for selection-related examples). Researchers who subscribe to this theory use specific aptitude measures, such as quantitative, verbal, or spatial ability tests to predict performance criteria. They may also use differentially weighted combinations of specific aptitude tests that are weighted to match the expected ability demands of the job being studied (e.g., Hedge, Carter, Borman, Monzon, & Foley, 1992). As one published example of this practice, Mumford, Weeks, Harding, and Fleishman (1988) used tailored mental ability test composites (specific combinations of ability subtests weighted to match job requirements), rather than a GMA score, to predict training grades.

Although specific aptitude theory continues to be used in research and practice, there has been little empirical support for it. Prior large sample research suggests that, for both training and job performance, weighted combinations of specific cognitive aptitudes explain little if any variance beyond GMA (Hunter, 1986; McHenry, Hough, Toquam, Hanson, & Ashworth, 1990; Ree & Earles, 1991; Ree et al., 1994; Schmidt, 2002). Moreover, a meta-analytic comparison between GMA and specific aptitude tests revealed that validities for GMA are always higher than for specific aptitudes (Salgado, Anderson, Moscoso, Bertua, & de Fruyt, 2003a).

However, the prior research testing specific aptitude theory has limitations that the present study circumvents. Some studies using training performance as a dependent variable are limited in that they average across job families when estimating the relative validities of GMA and specific aptitudes (e.g., Hunter, 1986). This procedure may make it less likely to find effects for specific aptitudes, as the ability demands of training programs may differ across jobs that are grouped together in large job families. Other studies on training performance are limited because they do not correct fully for measurement error (e.g., Ree & Earles, 1991). Failing to correct for measurement error leads to biased multiple correlations and regression weights, and, potentially, erroneous conclusions about construct-level relationships (Hunter & Schmidt, 2004; Schmidt, Hunter, & Caplan, 1981).

There is research on specific aptitude theory using job performance as the criterion, but these studies also have limitations. First, these studies also do not fully correct for measurement error (e.g., McHenry et al., 1990; Ree et al., 1994). In addition, they use small samples within job family (Ree et al., 1994) or present results only for large job families (McHenry et al., 1990). Reliance on small samples increases the likelihood that results are distorted by sampling error. Finally, the generalizability of findings from job performance to training performance should not be assumed.

The purpose of this study is to examine specific aptitude theory with regard to training performance. This study improves on prior studies by using a larger data set and improved statistical methods. The data set includes 10 large sample training schools in the Navy, with an average sample size of 2608 [in contrast to average sample sizes of 148 (Ree et al., 1994) and 952 (Ree & Earles, 1991)]. The data analyses correct for range restriction and measurement error, and both regression and structural equation modeling (SEM) are used to ensure that results are not limited to one analytic approach. Correcting for measurement error is an important advance in this study, as prior research has not consistently performed such corrections. In this study, the ‘‘true score’’ or construct-level relationships between mental abilities and training performance are estimated along with the relationships at the observed score level. As described later, these two types of analyses answer different questions.

The data used in this study have an additional property that enhances their information value. The jobs under study differ in complexity level, allowing us to test specific aptitude theory across jobs of different complexity levels. Prior research has demonstrated that job complexity moderates the relationship between GMA and job performance (Hunter & Hunter, 1984; Salgado et al., 2003b; Schmidt & Hunter, 1998), with higher complexity jobs exhibiting higher validities. However, moderation of validity by complexity level is often weak for performance in training programs (e.g., Hunter & Hunter, 1984), possibly because of the pooling of data into large job families. Therefore, we explore the moderating effect of complexity with particular emphasis on whether specific aptitude theory holds in training certain jobs, but not others. It could be argued that specific aptitude theory is more likely to hold in training programs for low complexity jobs, where the effect for GMA is lower.

**Specific Aptitude Theory**

There are three levels of ability that can be estimated from mental ability tests: specific aptitudes, general aptitudes, and GMA. Specific aptitudes are assessed by individual tests, such as paragraph comprehension, mathematics knowledge, or mechanical comprehension. Such tests are often correlated and can be combined to measure general aptitudes, such as verbal or quantitative ability. At the broadest level, GMA represents the shared variance among all of these tests. Lubinski (2000) has noted that conceptual definitions of GMA vary, but generally converge on abilities to engage in abstract reasoning, solve complex problems, and acquire new knowledge. There is considerable agreement that mental abilities are organized hierarchically with GMA serving as a latent factor causing the positive correlations among various mental ability tests. This approach to conceptualizing and operationalizing GMA has resulted in a wealth of validity evidence supporting the conclusion that GMA predicts many life and work-related outcomes (Jensen, 1998; Lubinski, 2000; Ones, Viswesvaran, & Dilchert, 2004; Schmidt & Hunter, 2004).

Specific aptitude theory suggests that regression weighted combinations of specific and/or general aptitudes will be better predictors of work-related outcomes than GMA. For example, in occupations that include numerous math-related tasks such as accounting or financial planning, it would be hypothesized that the regression weight on quantitative aptitude would be larger than the weight for other aptitudes. Moreover, it would be hypothesized that the multiple R produced by the specific aptitudes tests would be larger than the zero-order validity of a GMA measure, which would include only the shared variance among all the specific aptitude tests used as its indicators.

**Prior Research**

Most recently published evidence disconfirms specific aptitude theory. Four studies are noteworthy because they employ large samples and suggest that specific aptitudes provide little incremental prediction over GMA. Two of these studies examine training performance (Hunter, 1986; Ree & Earles, 1991), and the other two examine job performance (McHenry et al., 1990; Ree et al., 1994). Each is discussed below followed by an explanation of its limitations.

Training Performance. Hunter (1986) summarized data from 82,437 military trainees to show that the average predictive validity for GMA (.63) is equal to or higher than the average predictive validity (average adjusted multiple R) of specific ability test composites (.58–.63). The primary limitation of the Hunter (1986) study is that validities are not reported for individual jobs but for large groups of jobs. It could be that different specific aptitudes are important in different jobs, and that averaging across jobs masks differences in specific aptitude validities. As a result, the importance of specific aptitudes may have been underestimated.

Ree and Earles (1991) examined 78,041 Air Force enlistees who completed both basic and specific job training programs. Across 82 job training programs, the authors demonstrated that the factors in the Armed Services Vocational Aptitude Battery (ASVAB) remaining after controlling for the first principal component, which represents GMA, produced little incremental validity over the GMA factor. A limitation of this work is that analyses did not correct for measurement error in either the independent or dependent variables. The limitation of this approach will be discussed in more detail later in this section.

Job Performance. Two studies examined specific aptitude theory but with job performance as the dependent variable. As part of Project A, McHenry et al. (1990) analyzed nine jobs (average N = 449) and found that across five job performance factors the validity of GMA was always greater than the validity of spatial ability or perceptual-psychomotor ability. Using Air Force data across seven jobs (average N = 148), Ree et al. (1994) found that specific abilities incremented the prediction of job performance over GMA by only a small amount (.02 on average). Neither of these studies performed corrections for measurement error.

In summary, the evidence presented to date casts doubt on specific aptitude theory. However, limitations in these studies indicate the need for further research. A stronger test of the theory with regard to training performance would examine validities for training success using large samples for individual jobs, and would fully correct for measurement error. In addition, this research would examine separately the role of specific aptitudes, general aptitudes, and GMA in predicting training success.

Role of Measurement Error. As noted by Schmidt et al. (1981), theory-driven research should examine validities at the true score or construct level. The true-score level refers to the relationship among the constructs free from measurement error and other statistical biases. Examining validities calculated on imperfect measures often produces an inaccurate picture of the relative importance of the abilities themselves. This occurs because partialling out imperfect measures does not fully partial out the effects of underlying constructs (Schmidt et al., 1981). To illustrate, suppose that Ability A is a cause of training performance but Ability B is not. Suppose further that the tests assessing these abilities have reliabilities of .80 and are positively correlated (as occurs with all mental ability tests). Because Ability B is correlated with Ability A, Ability B will show a substantial validity for training performance. Moreover, because Ability A is not measured with perfect reliability, partialling it from Ability B in a regression analysis would not partial out all of the variance attributable to Ability A. Thus, the measure of Ability B will receive a substantial regression weight when in fact the construct-level regression weight is zero. That is, Ability B will predict training performance even though it is not a true underlying cause of training performance. In this case, Ability B will appear to increment validity over Ability A only because of the presence of measurement error (see also Schmidt & Hunter, 1996).

To obtain accurate population estimates, the relationships among predictors and criterion must be corrected for measurement error before computing validities. None of the prior research on specific aptitude theory has examined prediction with true scores. That is, research to date has not examined the relationships among specific aptitudes, GMA, and performance after correcting for measurement error in both the criterion and predictors.

**Hypotheses**

In contrast to specific aptitude theory, GMA theory predicts that the primary cognitive variance that predicts learning outcomes, such as training performance, will be contained in the general factor underlying mental ability tests scores (Jensen, 1998; Schmidt & Hunter, 2004). General intelligence has been shown to predict learning in countless studies, and it is viewed by many to be the primary individual difference determinant of learning outcomes (Gottfredson, 2002; Lubinski, 2000).

As presented by Schmidt and Hunter (2004), the GMA model of training performance implies a model in which there are no effects for specific and/or general aptitudes on training performance above that accounted for by GMA. This model as captured by subtests of the ASVAB is depicted in Figure 1. Verbal (VERBAL), quantitative (QUANT), and technical (TECHN) are general aptitudes captured by various tests in the ASVAB. In comparison, Figure 2 presents a model that does not contain the GMA factor, and the three general aptitudes directly influence training performance. Specific aptitude theory predicts that the model in Figure 2 will result in better prediction of training performance and better model fit than that produced by the model in Figure 1. On the other hand, GMA theory predicts that the use of specific mental ability tests or general aptitudes will produce no gain in prediction over and above that produced by the GMA factor.

Another issue that is examined with these data is whether the magnitude of the effect of GMA varies across training programs for different jobs. Evidence for such differences is mixed. Hunter and Hunter (1984) found relatively small differences in predictive validities for training performance across job families, but the jobs in this study had limited variability in complexity. The jobs studied were of medium or higher complexity. In contrast, Salgado et al. (2003b) found that, after correcting for multiple statistical artifacts, training validities increased from low to high levels of job complexity (r’s of .36, .53, .72 with increasing complexity). This latter finding is consistent with the general finding that GMA predicts performance better for more complex jobs (Gottfredson, 2002; Schmidt & Hunter, 2004). Consistent with these findings, we predict that GMA validities will be higher for training programs of more complex jobs. Moreover, we expect that, if specific aptitude theory is supported at all, it will receive more support in jobs of lower complexity where the effects of GMA are smaller.

**Method**

**Sample**

Data for this study were drawn primarily from three sources. First, predictive validities for the ASVAB test battery were obtained from 26,097 trainees enrolled in 10 of the largest Navy technical (‘‘A’’ Class) schools in 1988. Schools and their associated jobs are described in Table 1. Specific demographic information could not be obtained on these particular trainees but it is known that they were nearly all males between the ages of 18 and 30, with the majority being Caucasian. Second, correlations among subtests of the ASVAB were obtained from the 1987 applicant population (N = 143,856). This eliminated the need to correct the subtest inter-correlations for range restriction because the correlation matrix is the population matrix of interest (as explained later, the validity coefficients did require correction for range restriction). Third, we calculated test reliabilities from the alternate form reliabilities of ASVAB subtests from the 1983 norming study with 5,517 service applicants (Technical Supplement to the Counselor’s Manual for the ASVAB Form-14, 1985). We used the reliabilities for males in grades 11 and 12, as the majority of trainees in this study were male. The reliabilities were adjusted to correspond to the test score standard deviations in the 1987 applicant population (see Magnusson, 1966, pp. 75–76; Nunnally & Bernstein, 1994, p. 261, Eq. 7–6). The test reliabilities ranged from .91 (mathematics knowledge) to .78 (electronics information). Because the alternative form reliabilities were obtained by correlating tests taken on the same day, these reliabilities are slightly inflated. They do not fully control for transient measurement error (Schmidt, Le, & Ilies, 2003). Consequently, these reliability estimates result in a slight undercorrection for measurement error.

**Measures**

Training success was assessed with final school grade (FSG) received by trainees in their school. FSG is typically created as the average of several multiple choice test scores administered throughout training (e.g., Ree & Earles, 1991). We could find no established estimate for reliability of FSG, but as it is based on multiple tests within each course, it is likely to be highly reliable. In the reported analyses we assumed a reliability of .90. Analyses were also conducted presuming no measurement error (reliability = 1.0) and lower reliability (reliability = .80). The results (available upon request) did not vary substantially from those reported here.

Specific aptitudes were measured as scores on individual ASVAB tests. Subjects took one of the parallel forms of the ASVAB administered in 1988 – Form 11, 12, 13, or 14. Schmidt and Hunter (2004) presented a measurement model for GMA based on six subtests of the ASVAB: Word Knowledge (WK), General Science (GS), Arithmetic Reasoning (AR), Mathematics Knowledge (MK), Mechanical Comprehension (MC), and Electronics Information (EI). In this study, the Paragraph Comprehension (PC) test was substituted for the GS test as an indicator of Verbal aptitude because it has a lower cross loading with the Technical general aptitude factor described below. The other ASVAB subtests were not included either because they are speeded tests that have low loadings on general aptitude and GMA factors (Coding Speed and Numerical Operations; Hunter, 1986; McHenry et al., 1990) or because of cross-loadings across general aptitude factors (General Science and Auto/Shop Knowledge; Kass, Mitchell, Grafton, & Wing, 1982). Notably, in 2002, the Coding Speed and Numerical Operations subtests were dropped from the ASVAB. More complete descriptions of these tests are available elsewhere (e.g., Kass et al., 1982; Murphy, 1984).

The differential weighting asserted by specific aptitude theory was operationalized via regression and path analysis. In the regression analyses, general aptitudes were assessed as composites of their associated specific aptitudes. General aptitude factors were estimated based on the following equally weighted indicators: Quantitative (Q: AR and MK), Technical (T: MC and EI), and Verbal (V: WK and PC). For use in the true score regression analysis, reliabilities of these composites were estimated using the composite reliability formula in Hunter and Schmidt (2004, p. 438, Eq. 10.14). The reliabilities were .85 (V), .86 (Q), and .80 (T). In the SEM (or path) analyses, general aptitudes were operationalized as the latent factor causing their two associated indicator tests.

The ASVAB does not have an overall score, nor is one created by the military in the use of this particular test. For the bivariate analysis at the observed score level, we created an overall GMA composite that is the equally weighted sum of the three general aptitude scores defined earlier. For example, the Quantitative aptitude score was defined as AR+MK. For each job, the observed correlation between this GMA composite and the criterion of training success was computed using the formula for the correlation of composites given in Nunnally and Bernstein (1994). Reliability of this composite was estimated to be .85 using the composite reliability formula (Hunter & Schmidt, 2004, p. 438), and this reliability was used to make the correction for measurement error required to estimate the construct level GMA correlation with the training success criterion. In SEM analyses, GMA was operationalized as a second order factor causing the three general aptitudes.

Training complexity was assessed with two measures obtained from different sources. Length of training (in days) was obtained via archival descriptions of the training programs on a Navy recruiting website. Length of training varied from 30 to 89 days. Unfortunately, data could not be obtained on two jobs that had been phased out by the Navy since 1988. Length of training should capture the relative complexity of the training program as longer training programs would be necessary to cover the knowledge requirements of more complex jobs. The second source of complexity data was Hedge, Carter, Borman, Monzon, and Foley (1992). The authors had 23 experts rate the ability requirements of Navy technical schools, including those in this study. Experts rated the quantative, verbal, and technical ability requirements on a 3-point scale (0 = ability not required, 1 = ability somewhat important for success, and 2 = ability very important for success), with considerable agreement (intraclass correlation of .95). The sum of these ratings was used as the measure of complexity for each school, as greater mental ability requirements would be estimated for more complicated training programs. The ability requirement measure of complexity varied from two to five, and despite its limited range, it correlated highly with length of training (r = .60).

**Analysis**

Prior research on specific aptitude theory has tended to use a single analytical technique – either regression based on observed scores, or regression based on partially corrected scores (corrected only for range restriction and measurement error in the dependent variable). In this study we present regressions for partially and fully corrected scores, and we present SEM results. SEM results also fully correct for measurement error, although the statistical method used differs from the method we use to perform the fully corrected regression. As a result, the inclusion of SEM results reveals whether the construct-level results vary by statistical technique.

Before all analyses, the predictive validities were corrected for range restriction, using the Lawley (1943) formula, and for measurement error in the FSG measure of training success using the classic disattenuation formula. As noted earlier, subtest inter-correlations were not corrected for range restriction because the applicant population matrix was used in all analyses.

Figure 3 summarizes the analysis plan. In Figure 3, the first row indicates that we present three regression analyses with partially corrected scores. These analyses are similar to the analyses presented by Ree and Earles (1991), and make no adjustments for measurement error in the predictors.

The second row indicates that we present three regression analyses with true scores, correcting the observed data for measurement error in the predictors.

These results provide a picture of construct-level relationships, rather than relationships between imperfect measures. Regressions in both rows were conducted with Hunter’s program REGRESS, which provides accurate standard error estimates for regression coefficients based on corrected correlations (Hunter & Cohen, 1995).

The third row indicates that two SEM tests are also conducted using LISREL 8.51, which uses a different estimation algorithm (ML instead of OLS) and a somewhat different method of correcting for measurement error. More specifically, corrections for measurement error in SEM are based on the congeneric model of measurement equivalence, in contrast to the parallel forms model of measurement equivalence that is the basis for corrections made using reliability coefficients (Nunnally & Bernstein, 1994).

In the analyses in which GMA is the only predictor (C, F, and H in Figure 3), the statistic of interest is the zero order correlation between GMA and the criterion. In all of the analyses with multiple predictors, the primary statistic of interest is the adjusted R. The adjustment for capitalization on chance was conducted using the Wherry formula (Cattin, 1980), which provides an estimate of the R that would be produced by the population regression weights. The sample size used in this adjustment was derived using a formula from Schmidt, Hunter, and Larson (1988); this formula is described by Ree et al. (1994). The formula adjusts the actual sample size to account for the increase in sampling error caused by range restriction corrections. One minor adjustment was made to the formula reported by Schmidt et al. (1988) and Ree et al. (1994). The standard error of the corrected correlation that was used to calculate the ‘‘Effective N’’ was calculated using a more accurate formula discussed by Raju and Brand (2003) and Hunter and Schmidt (2004, p. 109, Eq. 3.21). Table 1 provides the resulting ‘‘Effective N,’’ which is the N used in calculating the adjusted R (and all other statistics that include sample size).

Differences in predictive validity were examined across analyses A, B, and C, and across D, E, and F. Specific aptitude theory would suggest that validities for A and B should be larger than for C, and those for D and E should be larger than for F. Moreover, to the extent that specific aptitudes are more important in jobs that are lower in complexity, these differences should be more pronounced in jobs that have shorter training times and lower overall ability requirements. If, on the other hand, GMA provides equal or better prediction of training success in equations C and F, then specific aptitude theory is disconfirmed.

In the SEM analyses, the fit of the general aptitude and GMA models within each training school was examined, as well as the predictive validities. In addition to R values (adjusted for capitalization on chance in the general aptitude model), model fit for the general aptitude model and GMA model were calculated for comparison. Model fit statistics for the specific aptitude model (six tests predicting training performance) are not reported because the model is fully saturated (i.e., model fit is perfect).

The general aptitude (Analysis G) and GMA models (Analysis H) are not nested models because they contain different numbers of latent factors (three vs. four, respectively). Most methodologists suggest that informational or descriptive fit statistics (rather than comparative fit statistics) should be used under these conditions (Browne & Cudeck, 1993); such models do not use baseline models as the standard by which fit is judged. In the case of non-nested models, the baseline models differ, so observed differences in comparative model fit are difficult to interpret. Consequently, the following descriptive fit statistics are presented: (1) χ² to degree of freedom ratio, (2) Root mean square error of approximation (RMSEA), (3) Expected cross-validation index (ECVI), and (4) Akaike Information Criterion (AIC). There are no standard interpretations for the χ² to degree of freedom ratio (Bollen, 1989) but lower values indicate better fit. RMSEA values are typically interpreted as follows: .05 or lower indicates good model fit; .05 to .08 fair fit; .08 to .10 mediocre fit; and over .10 poor fit (MacCallum, Browne, & Sugawara, 1996). Both ECVI and AIC are less frequently used than comparative fit indices, and they do not have a standard interpretation; instead, they are used to directly compare alternative models. Both present descriptive values about the degree of fit of the predicted to observed correlation matrix. Thus, as with the χ² to degree of freedom ratio and RMSEA, lower values indicate better model fit. Although not typically suggested for comparing non-nested models, one comparative fit index is presented for purpose of illustration, the Tucker–Lewis Index or non-normed fit index (NNFI). In combination these fit statistics allow for a determination of whether the general aptitude or GMA models provide a relatively better fit to the observed data.

**Results**

Tables 1–3 summarize the data used for the analyses. Table 1 describes the training programs and presents the data on sample sizes, training length, and expert-rated ability requirements. Table 2 reports the reliabilities of and uncorrected inter-correlations among the six ASVAB subtests used in this study. As would be expected in an unrestricted sample, the tests are highly correlated (r’s range from .51 to .75), and reliable (alternate form reliabilities range from .78 to .91).

Table 3 reports the validities of the subtests for predicting FSG by school, corrected for range restriction and measurement error in FSG (but not for measurement error in the tests). The quantitative subtests display higher predictive validities than the other subtests, but the sample-weighted mean validities across tests (collapsed across schools) do not appear to vary substantially (r = .40–.49).

That is, the tests perform similarly in predicting FSG. In contrast, the mean validities across schools (collapsed across tests) vary substantially (r = .34–.58), suggesting that the validity of mental ability tests varies across training programs for different jobs.

Tables 4 and 5 summarize the regression analyses. Table 4 summarizes the regression analysis based on observed predictor scores. Analysis A presents the prediction of FSG by the 6 subtests. Across the 10 schools, adjusted R values range from .43 (BT) to .73 (ET), with a sample-weighted mean value of .55 across schools. Analysis B presents the prediction of FSG by the 3 general aptitudes. Adjusted R values range from .43 (BT) to .73 (ET), with a sample-weighted mean value of .55. Analysis C presents the prediction of FSG by GMA; zero-order prediction ranged from .42 (AM) to .71 (ET) with a sample-weighted mean of .55.

The last two columns in Table 4 present the differences between these values, which are remarkably small and do not vary much between schools. Because the pattern of results is similar across the 10 schools, averages are informative. The average difference in adjusted R between A and B is .00, between A and C is .01, and between B and C is .01. These gains in predictive validity for using specific aptitude or general aptitude over prediction from GMA are very small. These results shed light on the predictive gains from regression-weighted measures of specific and general aptitudes in an applied selection context; the maximum improvement in validity from using specific or general aptitudes is less than 2%, which was the figure reported by Ree and Earles (1991) and Ree et al. (1994).

As noted earlier, the results of observed score analyses can be misleading when one’s concern is theoretical and the research questions of interest involve the underlying constructs. This occurs because measurement error in the predictors can distort both the multiple correlations and the relative size of the regression weights and cause observed measures to show incremental validity that does not exist at the level of the constructs.

Table 5 summarizes the regression analyses for scores corrected for measurement error in the predictors. Analysis D presents the prediction of FSG by the six subtests. Across the 10 schools, adjusted R values range from .44 (BT) to .75 (ET), with a sample-weighted mean value of .56 across schools. Analysis E presents the prediction of FSG by the three general aptitude constructs. Adjusted R values range from .45 (BT) to .77 (ET), with a sample-weighted mean value of .58. Analysis F presents the prediction of FSG by GMA; zero-order prediction ranged from .46 (AM) to .77 (ET) with a sample-weighted mean of .58.

The last two columns in Table 5 indicate that the differences between these values are small and do not vary much between school. Again, because the pattern of results are similar across the 10 schools, averages can be used to illustrate. The average difference between D and E is -.01, between D and F is -.02, and between E and F is -.01.

Thus, on average, prediction by the GMA construct is better than prediction by weighted combinations of either the general or specific aptitude constructs, although by very small margins. This finding is very close to the GMA theory prediction of equal predictive power, but very different from the prediction of specific aptitude theory. Moreover, the small predictive advantage gained by including specific aptitude tests, shown in Table 4 and in prior research (Ree & Earles, 1991; Ree et al., 1994), completely disappears in these construct-level analyses. The hypothetical illustration presented earlier from Schmidt et al. (1981) presents the conceptual explanation for this reversal. The presence of measurement error causes measures of specific and general aptitudes to make contributions (however small) to prediction that do not exist at the construct level.

Table 6 summarizes the SEM analyses. Analysis G presents the three factor general aptitude model (see sample model in Figure 2); Analysis H is the GMA model in which the three general aptitudes load onto GMA, and GMA predicts FSG (see sample model in Figure 1). Fit indices presented in this table demonstrate that the models fit the data. Sample-weighted mean fit indices for the three-factor and GMA models are, respectively: χ² to degree of freedom ratios of 4.21 and 4.35; RMSEA’s of .06 and .06; ECVI’s of .10 and .10; AIC’s of 75.86 and 81.80; and NNFI’s of .98 and .98. While the NNFI values are very high and suggest excellent fit, the RMSEA’s include both good (< .05: MM, ST) and fair fit (between .05 and .08: AE, AM, BT, ET, EM, OS, RM, and SM). In only one case (RM with GMA model, RMSEA = .087) does an RMSEA value exceed the .08 threshold for fair fit. Thus, both models fit the data reasonably well, and these minor differences aside, the models fit all 10 schools.

The general trend in all of these indices is for the three-factor model to fit the data better, but the differences are small enough to be considered negligible. Thus, despite the addition of a latent factor and constraints imposed by forcing the general aptitudes to load on that factor, the GMA model fits the data as well as the three-factor model.

As would be expected, predictive validities using SEM are similar to the corrected analyses presented in Table 5. The small difference obtained (-.01) across schools slightly favors the GMA model over the general aptitude model. Again, the difference is too small to be of importance. Moreover, as with the construct-level analysis reported in Table 5, the results in Table 6 do not show any incremental prediction from including specific or general aptitudes beyond GMA.

Finally, analyses were conducted to examine possible differences in validities based on complexity of training. Results from Tables 4–6 all reveal what appear to be two clusters of predictive validities. Meta-analysis of schools with lower validities (AE, AM, BT, MM, RM, and SM) and higher validities (ET, EN, OS, and ST) reveals sample-size weighted mean validities of .45 (90% confidence interval [CI] .427, .477) and .67 (90% CI .646, .702) from Analysis C reported in Table 4. These confidence intervals have no overlap, indicating that the population values differ substantially across the two clusters of schools. Moreover the magnitude of the validity increases as the length and expert-rated ability requirements of the training program increases. The zero-order correlation between length of school and GMA validity was .77; the mean length of school is 75 days in the high validity cluster, and 47 days in the low validity cluster. The zero-order correlation between ability requirements and GMA validity was .33; the mean ability requirement is 4.00 in the high validity cluster, and 3.50 in the low validity cluster. Thus, based on these correlations, it appears that the greater the complexity of the training program, the higher the validity for GMA.

Finally, the pattern of results for specific aptitude vs. GMA theory was not affected by either apparent training complexity or magnitude of the GMA validity. Differences in predictive validity across general aptitude and GMA models reported in Table 4 (observed predictor score regressions) varied from only .00 to .04, and these differences were identical in the high complexity/high GMA jobs (average difference = .01) and the low complexity/low GMA jobs (average difference = .01). Differences in general aptitude and GMA predictive validity reported in Table 5 (corrected predictor score regressions) varied from -.04 to .04, and again these differences were similar in the high complexity/high GMA jobs (average difference = -.02) and the low complexity/low GMA jobs (average difference = .00). Finally, in the SEM analyses reported in Table 6, differences in general aptitude and GMA predictive validities only varied from -.02 to .01, and were similar in the high complexity/high GMA (average difference = -.02) and low complexity/low GMA (average difference = -.01) clusters. The predictions from GMA theory were supported across jobs of varying complexity and GMA demands.

**Discussion**

This study overcame and avoided the methodological deficiencies of previous studies on the question of incremental prediction of specific aptitudes over GMA. More specifically, large sample individual jobs (rather than job families) that varied in complexity were examined, and measurement error corrections were made using multiple approaches. Given the importance and plausibility of specific aptitude theory, testing the theory under optimal conditions with the most accurate available statistical techniques is necessary to advance our understanding of the link between mental abilities and training performance.

With the improved methods used in this study, specific ability tests provided little if any incremental validity in the prediction of training success over GMA. This finding held through three different approaches to the data analysis – regression based on observed predictor scores, regression based on construct scores, and SEM (which is another method of examining relationships among construct scores). Notably, the 2% incremental prediction found in prior research (e.g., Ree & Earles, 1991) effectively disappeared when corrections for measurement error were performed in the latter two analyses. In combination with prior research, these results provide strong evidence against specific aptitude theory.

**Theoretical Implications**

These results suggest that specific aptitude theory should not be retained in the prediction of global measures of training success. These results do not go so far as to indicate that specific aptitudes have no psychological significance or meaning, but they constitute compelling evidence that learning for a variety of jobs is predominately determined by GMA, not by specific aptitudes. That is, they show that the specific factors in the aptitude measures (that is, the factors measured in the specific aptitude tests beyond GMA) do not contribute to prediction. Likewise, the components of the general aptitudes (V, Q, and T) that go beyond merely reflecting GMA do not contribute to prediction. This is the major theoretical implication.

These results may help explain recent meta-analytic findings. Based on data from European countries, Salgado et al. (2003a) showed that GMA has higher predictive validities for training performance than specific ability tests. The mean estimated operational validity of GMA was .54 (K = 97, N = 16,065) whereas the validities for more specific aptitudes varied from .48 to .25. Viewed from the perspective that specific aptitudes are imperfect indicators of GMA, specific aptitudes tests predict some but not all of the variance in training performance that can be predicted by GMA. Because each specific aptitude test is a relatively poor indicator of GMA, predictive validities for specific aptitudes should always be lower than when a more complete measure of GMA is used.

Specific aptitude theory can be viewed as a special case of the theory that matching predictor and criterion constructs will lead to higher validity. For example, specific aptitude theory says that if a job involves reading and writing, a verbal ability test will have higher validity than a GMA test, because the verbal construct is predominant in both the predictor and the criterion, producing a match. Conversely, the reason specific aptitude theory predicts lower validity for GMA is that the construct of GMA is quite different from, and does not ‘‘match,’’ the construct of verbal ability that appears to be required for the job. So it is clear that the results of our study contradict the predictor-criterion construct matching theory in the area of mental abilities. However, for other predictors, such as job knowledge tests, that theory may be valid. For example, of several job knowledge tests, the most valid one is likely to be the one whose content most closely matches the content of the job.

The above example raises the question of the precise nature of the difference between specific aptitude and job knowledge tests. The key difference is their relationship with GMA. Specific aptitude measures have higher GMA loadings than job knowledge tests in most populations. Job knowledge tests are expected to have high GMA loadings in groups in which all members have had equal opportunity to learn the knowledge content. This would be true, for example, if the subjects were incumbents and all had been on the job the same length of time. However, such groups are rare and in most applicant samples individuals differ widely in previous opportunity to learn the specific content of the knowledge tests and therefore score differences are due less to GMA and more to differences in previous opportunity, resulting in lower GMA loadings. By contrast, measures of specific aptitudes have high GMA loadings in all groups. This, rather than the content of the test, is the critical difference between specific abilities and job knowledge tests. For example, in many previous studies using the ASVAB the subtests General Science has been found to be an excellent measure of verbal aptitude and to have a high GMA loading. Although this is ostensibly a measure of knowledge, the knowledge domain measured is quite broad and every individual has had substantial opportunity to learn this general knowledge. In addition, the knowledge is conceptual in nature. Hence it serves as an excellent measure of a specific aptitude and has a high GMA loading.

These findings also have implications for the question of whether the predictive validity of GMA varies across training for different types of jobs. In contrast to some prior research (e.g., Hunter & Hunter, 1984; Jones & Ree, 1998), validities in this set of jobs varied considerably. Specifically, the largest validity from the SEM analysis (.78) was 70% greater than the lowest validity (.46). Moreover, there was a clear pattern to these differences; the predictive validities increased substantially as the complexity of the training increased. Of course, the measures of training complexity were indirect because a more direct measure could not be obtained for these data. However, both measures of complexity indicated the same results, thus raising confidence in our conclusion. We can safely conclude that the validity of GMA is high across all programs but not identical in magnitude.

**Practical Implications**

The primary practical implication of this finding is that weighted combinations of specific aptitudes tests, including those that give greater weight to certain tests because they seem more relevant to the training at hand, are unnecessary at best. At worst, the use of such tailored tests may lead to a reduction in validity. For prediction of training success, a good measure of GMA is likely to yield prediction at least as good as that produced by multiple aptitude measures in a regression equation. This point is particularly useful for researchers who seek to control for abilities relevant to learning when studying other constructs, such as motivation to learn (Colquitt et al., 2000). In such situations, a GMA measure can be considered sufficient for controlling for mental abilities, at least when examining overall training success.

It is worth revisiting the distinction between training and job performance and its relevance for this study. While these findings specifically address training performance, they have implications for understanding and predicting job performance as well. Prior evidence strongly suggests that training performance and job performance are correlated, with training performance and associated job knowledge serving as a meaningful determinant of job performance (Hunter, 1986). Moreover, many authors argue that with increasing complexity and dynamism of work today, workers are required to continually update their skills by training and other less formal means of learning (e.g., Kraut & Korman, 1999). From this vantage point, the ability to learn is not only a predictor of job performance, but arguably an increasingly important component of it as well. Consequently, we believe these findings would be replicated if conducted with job performance measures as dependent variables.

**Limitations and Future Research**

Despite the large sample sizes, this study does have limitations. First, analyses were not conducted on a representative sample of Navy or, for that matter, civilian jobs. Data from large sample jobs were specifically requested from the Navy in order to reduce sampling error. Future research on mental ability and training performance could seek a broad set of representative jobs, including some jobs that are less technical and less heavily dependent on GMA (e.g., basic customer service jobs). However, prior research that uses a broader sample of jobs has found similar results with regard to specific aptitude theory (Hunter, 1986; Ree & Earles, 1991), so the conclusions are unlikely to differ. Second, some information about the jobs and schools was missing, and as a result it was necessary to use indirect and sometimes incomplete measures. Detailed information about each school would have been useful to determine if some feature of a school other than complexity affected validities. Snow (1989), for example, indicates that the largest aptitude-by-treatment interaction found in educational research is for intelligence and structure, with less intelligent students benefiting much more from structured material than more intelligent students. Military training is highly structured and developed using a standardized instructional design process, thus it is unlikely that schools had vastly different instructional characteristics. Nevertheless, it is possible that differences in instructional process across schools may have played a role in the observed effects. Third, this study uses only a global indicator of training success – final course grade. Future research might benefit from decomposing final grade into different learning outcomes, such as the acquisition of knowledge, acquisition of skill, and socialization to desired attitudes and values (Kraiger, Ford, & Salas, 1993). Although prior research does not suggest that specific aptitudes will provide better prediction of narrower training criteria than GMA (Duke & Ree, 1996; Olea & Ree, 1994), future research could examine even more fine-grained measures of training success, particularly desired attitudes and values which have received relatively little research attention.

**Conclusion**

Specific aptitude theory has intuitive appeal because it suggests that each individual may have personal strengths with regard to mental abilities that allow him/her to succeed at different learning tasks. Despite its appeal, the data presented here do not support the theory. Optimally weighted combinations of specific aptitudes that serve as indicators of GMA do not provide incremental validity over GMA for the prediction of training success. Moreover, the GMA causal model fits observed data across jobs as well as the specific aptitude model. Thus, we conclude there is no reason to expect that tailored test composites will be more useful than an overall measure of GMA in predicting overall training success.

]]>**1. Introduction**

It is usually believed that when we move up the socio-economic (SES) ladder, the racials gaps tend to be reduced mostly because lower-scoring groups are supposed to be affected by poor environmental and cultural influences. It has been argued for example that the magnitude of cultural differences correlates with the magnitude of racial differences (Kan et al., **2013**) while their variable of interest, “cultural load”, is **questionable**. This is important because culture also varies within SES levels irrespective of race (see, Murray, **2012**, for illustration purposes). This would imply that high-SES families, living in more prosperous areas, are culturally advantaged notably due to peer effects in everyday life, not only in schools. At the same time, if we were going to argue that blacks are affected by different kind of environments as a way of counter-arguing against the positive BW-SES interaction, this implies that the default hypothesis (Jensen, **1998**, pp. 443-460) must be rejected. Empirically, however, the default model is found to be tenable, that is, the within-group environmental influences and the beween-group environmental influences share the same roots (Rowe et al., **1994**, **1995**, & Cleveland, **1996**; see furthermore, Dolan, **2000**; & Hamaker, **2001**; Lubke et al., **2003**, for tenability of measurement equivalence).

Previously, Jensen (**1973**, pp. 241-242; **1980**, p. 44) provided some evidence of a rather strong positive race-SES interaction. The evidence from the most recent survey data reveals that while the race*SES interaction is real, such effect is not as large as it appeared in Shuey’s (1966) data, cited by Jensen, where the BW gap nearly doubled.

**2. Technical notes**

**2.1 Data**

**GSS** : the Wordsum 10-item vocabulary test is used as a proxy of verbal IQ. An example of item tests can be found here. Not only the test is short, but Wordsum reliability is also rather low (~0.60). So, it is better not to over-generalize whatever the result is. Regarding the SES index, in the **GSS codebook**, we read : “SEI scores were originally calculated by Otis Dudley Duncan based on NORC’s 1947 North-Hatt prestige study and the 1950 U.S. Census. Duncan regressed prestige scores for 45 occupational titles on education and income to produce weights that would predict prestige. This algorithm was then used to calculate SEI scores for all occupational categories employed in the 1950 Census classification of occupations. Similar procedures have been used to produce SEI scores based on later NORC prestige studies and censuses.” (p. 2216). The sampling weight in use is WTSSALL. Given the discussion in the **GSS codebook, Sampling Design & Weighting**, Appendix A, p. 2110, WTSSNR might be better, but it applies only for the years 2004+. Before 2004, all cases were given a weight of 1, in other words, no weight at all. Finally, I restricted the sample to people aging between 23 and 67 years because outside this range I noticed that the verbal score is extremely low for unknown reasons.

**Add Health** : the test used is again a vocabulary test, AHPVT, an abbreviated version of PPVT, administered in Wave 1 (mean age = 16) and Wave 3 (mean age = 22), considered by Jensen (**1973**, **1980**) as a “parody” of a culturally biased test. Nevertheless, Jensen also reported the absence of racial bias in the PPVT. The SES variable used is PA12, highest education attained by the parent (respondents; female=5125, male=360).

**NLSF** : the tests used presently are the SAT composite (verbal+quantitative) and the ACT composite. The SES used is the parents household income. Unfortunately, I was unable to find the age variable and the sampling weight. Thus, the results need not to be over-generalized.

**HSLS 2009** : The test variable is the X1 mathematics theta scores. My SES variables consist of a composite (5-categories) variable calculated using parent/guardians’ education (X1PAR1EDU and X1PAR2EDU), occupation (X1PAR1OCC2 and X1PAR2OCC2), and family income (X1FAMINCOME). The weight used is the parent weight, W1PARENT, because I use children characteristics in combination with parents’ characteristics.

**NLSY79 and NLSY97** : the test used is, as usual, the ASVAB. Because the subtests variable were available, I created a g-score and non-g score variables for comparison matters. PIAT math scores were also available in NLSY97. The SES is the parents’ highest grade attained, parental occupation (see **Attachment 3**), and family income. The sampling weight used is not the cross-sectional weight but the panel longitudinal weight (e.g., R0614600 for NLSY79). What need to be reminded is that when we use data for multiple years (e.g., 1997, 1999, 2001) we need to use a longidutinal weight for the last year (i.e., 2001) among the variables. On the other hand, the NLSinfo recommends the use of the so-called “customized weights” available in this **webpage**. Such longitudinal weight can be obtained by selecting the “all years” option. However, when I compared these newly created variables with the regular sampling weights a few times, the results from regressions, correlations and means were not different.

**CNLSY79** : this set can be found along with the NLSY79 and NLSY97. Five tests were available, PIAT math, PIAT reading recognition, PIAT reading comprehension, Peabody Picture Vocabulary Test revised form L, Wechsler Digit Span subtest. The parent SES variable is the highest grade attained by the respondent’s mother. Concerning sampling weight, as the CNLSY79 (p. 27) **user guide** made it clear, there is no longitudinal weight. When using data involving different years, custom weight program should be used.

**ECLS-K** : Two tests are available, reading and math. I used IRT scale score for mean comparison analyses and the T scores (i.e., standardized scores) for regression analyses because IRT variables are not normally distributed and even when using square and SQRT transformation, the T scores remain much more normally distributed. On the other hand, the IRT has a great advantage over T-scores because IRT scores can be compared longitudinally across the different waves/rounds. For my SES variables, I used WKSESQ5 (5-categories) and WKSESL (continuous), for mean comparison and regression analyses, respectively, which were derived from the logarithm of WKINCOME, WKMOMED, WKDADED, WKMOMSCR, (mother’s occupation GSS prestige score), WKDADSCR (father’s occupation GSS prestige score) (composites). Both BY and WK stand for “base year” or C1+C2. Also, because I use children scores in combination with parents’ characteristics, I must use parents weight. In the **ECLS-K base year data files and electronic codebook** (p. 4-11, or p. 73 in the tab) it is clearly stated :

C1CW0 : fall-kindergarten direct child assessment data and child characteristics, alone or when in conjunction with teacher/classroom data

C1PW0 : fall-kindergarten parent interview data (alone or in combination with child assessment data)

C1CPTW0 : fall-kindergarten direct child assessment data combined with fall-kindergarten parent interview data and fall-kindergarten teacher data

C stands for children, P for parents, T for teacher, C1 for rounds 1. To be more precise, C1, C2, C3, C4, C5, C6, C7 represent fall-kindergarten, spring-kindergarten, fall-first grade, spring-first grade, spring-third grade, spring-fifth grade, spring-eighth grade. More information **here**. To note, if we were analyzing the data longitudinally, for instance, variables at rounds 1, 2, 3, we must use the longitudinal sampling (panel) weight variable C123CW0 or C123PW0, depending on the kind of analysis we need (see the **user guide** for base year (BY) p. 82, and **user manual** for third-grade p. 9-5 or 160). With C123CW0, weights are nonzero if assessment data are present for the three rounds; with BYCW0, weight is nonzero for cases having data for both C1 and C2; with C1_4PW0, weight is nonzero if parent interview data is available for all the four rounds listed; and so on.

**2.2 Analyses**

In conjunction with mean comparison analysis for which I compute the SD differences, I conducted some multiple regression and ANCOVA analyses. The goal of the regression was to investigate the plausibility of interaction term between race (BW) and SES. Say, in model 1, we include age, gender, race, SES as predictors of IQ scores, and model 2, we just add the interaction term of race*SES, which is computed simply by multiplying the race variable (i.e., column) by SES variable (i.e., column). But first, how to introduce this topic ? **Phil Birnbaum**, for instance, explains the regression interaction terms as follows :

Suppose I want to figure out if stimulants help a student do better on an exam. So I run a regression to predict the exam score. I use a bunch of variables, like age, time studying, performance on other exams, grades on assignments, number of classes missed, and so on, but I also include a dummy variable for whether the student had (both) coffee and Red Bull before the exam.

After the exam, I run the regression, and I find the coefficient for “both coffee and Red Bull” is -3, and statistically significant. I conclude that if I were a student, I might consider not taking both coffee and Red Bull.

Fair enough, so far.

But, now, suppose I do the same experiment again, but, this time, I add a couple of new dummy variables — whether or not the student had coffee (with or without Red Bull), and whether or not the student had Red Bull (with or without coffee). I don’t remove the original “had both” variable — that stays in.

I run the regression again, and, again, the coefficient for “both coffee and Red Bull” comes out to -3 — exactly the same as last time. What am I able to conclude this time about the desirability of drinking both coffee and Red Bull?

The answer: almost nothing. That coefficient, *on its own*, does not give much useful information at all about how performance is affected by the coffee/Red Bull combination.

…

In a regression result, the simplest way to interpret the coefficient of a dummy variable is, “what happens when you change the value from 0 to 1 and leave all the other variables the same.” In the first regression, that works fine. But in the second regression, it can’t work. Because if you change CxR and leave everything else constant, your data and regression become inconsistent. You wind up with CxR being 1 (meaning both coffee and Red Bull), but you’ll have either C=0 (no coffee) or R=0 (no Red Bull). Those three variables are tied together, so you can’t just change CxR and leave the other two constant.

Put another way, there are four possible combinations for C, R, and CxR:

C = 0, R = 0, CxR = 0

C = 1, R = 0, CxR = 0

C = 0, R = 1, CxR = 0

C = 1, R = 1, CxR = 1You can’t change CxR from 0 to 1, and still have a combination that’s on the list. So the “change CxR but leave all other variables the same” strategy no longer works. If you change CxR from 0 to 1, you’ll have to change one of the other variables, too.

Which ones should you change? It depends what question you’re trying to answer. For example, suppose you do the regression and you get these coefficients:

C = -5

R = -10

CxR = -3If you’re trying to ask, “what’s the effect of taking coffee alone versus nothing at all,” it’s like asking, “what is the effect of changing (C=0, R=0, CxR=0) to (C=1, R=0, CxR = 0)?” The answer is -5.

If you’re trying to ask, “what’s the effect of taking both coffee and Red Bull versus nothing at all?”, it’s like asking, what’s the effect of changing (C=0, R=0, CxR=0) to (C=1, R=1, CxR =1)?” The answer is -18.

And so on. But none of those kinds of questions lead to the answer of -3 points, because none of these questions can be answered by changing CxR alone.

So what does the -3 represent? The non-linearity of the coffee and Red Bull variables. Or, put another way, the “increasing or diminishing returns” to combining coffee and Red Bull. Or, put a third way, the effects of the *interaction* of coffee and Red Bull, independent of their individual effects. Or, put a fourth way, the amount of effects *duplicated* from both coffee and Red Bull, that you can’t count twice even if you take both drinks.

A race*SES interaction, in that case, should be interpreted as evidence for gap increase, on the condition that black is coded 1 and white 2 (that is, a positive coefficient means advantage of the higher values versus the lower values in the race variable).

That being said, it is highly recommended to remove extreme low scores (e.g., -3 SD or less) when performing regressions because outliers would likely attenuate the Beta coefficients. A large sample size, on the other hand, may attenuate the impact of such outliers. Field shows how to detect non-normally distributed variables (**2009**, pp. 137-139) and how to deal with outliers (pp. 102-103) but for the latter case it is not necessarily justified to systematically remove the outliers (pp. 215-219). In the case of IQ scores however, I believe it is justified to systematically remove extreme low scores. It is probably wiser to avoid as much as possible the benchmark of “mental retardation” level.

Next, the **univariate ANCOVA**. It is simply an extension of univariate ANOVA, with the difference that it takes into account the impact of some covariates (e.g., gender, age) when comparing mean differences among different groups. It can also be useful for testing interaction effects (without creating the interaction variable). Given this video by **how2stats**, it is wrong to think of ANCOVA as an ANOVA of residualized variables (e.g., IQ scores with age/gender/SES regressed out), and the latter should not be used in lieu and place of ANCOVA. Anyway, I illustrate the process with some pictures below :

**3. Results**

Available data for the present analysis :

**Evidence of race-SES interaction from various survey data (EXCEL)**

Syntax used for the present analysis :

**Racial IQ gap by SES in the CNLSY79 (SPSS syntax)**

**Racial IQ gap by SES in the NLSY79 (SPSS syntax)**

**Racial IQ gap by SES in the NLSY97 (SPSS syntax)**

**Racial IQ gap by SES in the ECLS-K (SPSS syntax)**

**Racial IQ gap by SES in the HSLS 2009 (SPSS syntax)**

**Racial IQ gap by SES in the NLSF (SPSS syntax)**

**Racial IQ gap by SES in the Add Health (SPSS syntax)**

**Black-White gap over time and by SES (SPSS syntax) in the GSS and other gaps by SES from other survey data**

For those who want to see through this, the numbers clearly speak for themselves. To summarize, there is no gap increase in the NLSF, CNLSY79, Add Health, a slight gap increase in the NLSY79, NLSY97 (for ASVAB/g-scores, but not for PIAT scores for which the BW*SES interaction is extremely large) and ECLS-K, and a non-trivial gap increase in the GSS, HSLS2009. The mean scores comparison is very consistent with the interpretation of no gap decrease at higher SES levels.

One comment on the NLSY79 can be added. Herrnstein & Murrray (1994, p. 288) displayed a graph showing a rather strong positive BW*SES interaction using the same data set, although it is mostly explained by the lowest SES decile consisting probably of a smaller sample. The authors (Appendix 2, or pp. 598-599 in my edition with a new afterword by Charles Murray) used apparently a composite score of mother’s and father’s education plus family income and parental (mother+father) occupation with mean of 0 and SD of 1. I don’t know how they collapsed the variables since they are not measured on the same scale. Perhaps one way to compute the said variable is to factor analyze them and create a sort of SES general latent factor. In fact, this is exactly what I did : averaging the mother’s and father’s occupational status and grade level, added to this the family income. I then factor analyzed (using PAF) these 3 variables. The regression shows a non-trivial regression interaction term (0.155) for AFQT 2006-revised but not at all for both g-scores and non-g scores.

In parallel, ANCOVA is consistent with means comparison. To illustrate, the UNIANOVA function shows the following profile plot :

The line under the graph shows a mean value of 1.49 for gender variable (male=1, female=2) and 1982.01 because it partials out the effect of gender and age, when the two variables are held constant at the above given value.

Concerning the regression analyses, the regression interaction term effects are generally rather low, between -0.1 and +0.1 with more positive than negative interactions. Two anomalies however. The NLSF regression analysis is somewhat curious. There is a strong negative interaction term for the BW*SES variable concerning the ACT composite score while means comparison shows no such race-SES effects. ANCOVA shows no strong evidence of a decreasing gap because of the large variability among the 11 categories of the ‘parents household income’ variable. The NLSY97 presents an even more curious anomaly. When a simple means comparison reveals an slight gap increase, the interaction between BW and PARENTEDUC (20 categories parent grade variable) is strongly negative. On the other hand, when I use a PARENTEDUC3 (3-categories parent grade variable; low, medium, high) the interaction term becomes strongly positive. The same thing happened with ANCOVA. Using PARENTEDUC, we notice a decrease in the gap, but not when using PARENTEDUC3. The only way I can make sense of it is to think that the 20-categories variables had a lot of variability in the BW gaps among the numerous categories while the 3-categories variable improves reliability somewhat.

**4. Conclusion**

While I am still uncertain about the right explanation behind the somewhat positive BW-SES interaction, Jensen (**1973**, p. 119) thinks this is best explained in terms of black-white differential sibling regression to the mean where we could see an increasing black-white sibling regression gap at higher levels of IQ.

**1. Introduction**

The dichotomy between g crystallized (considered as depending more on prior acquired knowledge) and g fluid (as fluid reasoning, less dependent on scholastic knowledge) has not been widely considered in the many tests of Spearman’s hypothesis. In replying to Rushton, the reason why Flynn focused on g fluid is that because Flynn (**2000**, pp. 206-207) believes that the traditionally used g-loadings in the Wechsler are biased favorably towards more crystallized tests (the reason why he labelled the traditional g-loadings as “Gc” or g crystallized as opposed to “Gf” or g fluid). The same argument has been made by Ashton & Lee (**2005**). When Flynn argues that “ranking the WISC subtests in terms of fluid g would change the correlations from negative to positive” this depends on the Gc and Gf correlations. Flynn obtained a different result using his g-fluid loadings simply because Gc*Gf correlation tends toward zero. Rushton & Jensen (**2010**, pp. 12-14) noted this striking feature before. Additionally, Must (**2003**, p. 470) rightly pointed out that the dichotomy between g fluid and g crystallized is superfluous to the extent that the g-loadings of the ASVAB subtests, a highly crystallized-type test battery, correlate substantially with reaction time measures, a prototypical measure of fluid intelligence. The more complex the reaction time (RT) measures, the higher the correlation with the ASVAB g-factor (Vernon & Jensen, **1984**). So, this would appear as if the more crystallized tests were also the more fluid tests at the same time. More striking is that Jensen (**1998**, p. 124) argued that Gf and Gc are about equally heritable, as Flynn agreed. But see Davies et al. (**2009**) for evidence of Gf higher heritability. Anyway, Flynn’s result is not necessarily more valid than Rushton’s. On the other hand, this tells us nothing about this controversy.

As a technical note, Flynn (**2000**, p. 207) explained that the correlation between Wechsler subtests with Raven matrices can be seen as an index of g_Fluid (Gf) loadings. Jensen (**1980**, p. 632, **1998**, p. 38) indeed sees the Raven as a marker test for Spearman’s g. Besides, it is widely recognized that fluid test is less culturally influenced than crystallized test, a reason why Jensen argued that fluid-type tests like the Raven that “measures virtually nothing other than relation eduction” can be seen as the purest form of Spearman’s g. It was because FE gain was so large on fluid-type tests that Flynn has investigated this relationship. The likely reason is the increasing test-familiarity favoring gains on fluid-type tests (Kaufman, **2010a**, **2010b**) tending to inflate scores in recent cohorts. The huge Raven’s gains, for example, violate measurement equivalent, suggesting that the difficulty parameters of these tests are altered over time (Fox & Mitchum, **2012**).

Flynn argued that Rushton’s method must be revised on the grounds that “Rushton entered five data sets for IQ gains over time, that is, data for each period and each nation. As Jensen (**1998**, p. 30) points out, this is a mistake: multiple data sets for a single variable are likely to have more in common with one another than they do with anything else; therefore, the factor analysis will be biased towards isolating them as a separate cluster.” (p. 212). This sounded interesting but I have to disagree. If the variables have different clustering this is because they have different pattern of correlations. I used my own data set (go to Method section) to test it, putting altogether my 5 black-white differences vectors, my 4 crystallized g, 2 fluid g, as well as inbreeding depression, and 4 Flynn effects (throwing out the Scotland sample due to missing values). Because I have used the “Dimension Reduction” SPSS procedure, there was obviously a missing value (i.e., Digit Span) from the intercorrelation. Still, the clustering was consistent with Rushton’s. After reducing the variables to the strict minimum, following Flynn’s recommendations, my correlation matrix (see Appendix) shows the same pattern of clustering.

There is no evidence that the multiplicity of variables tends to isolate different variables as separate clusters. Concerning BW-WISC difference, it was just the opposite. A possible reason is that German and Austria gains had some modest correlations with BW differences. Using FE averaged gain tends to favor a separate clustering of BW and FE because the FE averaged gain has a very low correlation with BW. Multiple data sets do not appear to affect the clustering of the variables. Instead, the clustering of any given variable was depending on the pattern of the correlations shared with all other variables. The reason Flynn’s PCA and Rushton’s PCA diverge so widely is solely due to his g_Fluid variable, which correlates substantially with FE gains and BW gap while correlating moderately with inbreeding depression.

**2. Method**

Data :

**Flynn contra Rushton on PCA : A failed replication (EXCEL)**

All the computations have been summarized in the above XLS. Only BW differences, g-loadings, heritability, shared and nonshared environment have been corrected for subtests’ differing reliabilities. Of particular importance, it must be recalled that when mental retardation (MR) has negative sign in its correlation, this means that its depressive effects are in fact related with the said variable (e.g., g-loadings). So, for example, when entering MR variables in a factor analysis, all the signs, negative and positive, need to be reversed.

To briefly introduce, I have localized three samples having Wechsler (WAIS) subtests’ correlations with Raven matrices : Vernon (**1983**), Rijsdijk (**2002**), Johnson & Bouchard (**2011**). This is all I found. The total N was 420, compared to the total N of 483 for Flynn’s samples (N=5) which involve exclusively WISC subtests. My final g_Fluid variable is obtained by averaging gf-WISC and gf-WAIS columns. This should improve the accuracy of estimates, as shown by its highest correlations with all other Gf column vectors.

Now, as a test of the robustness of a correlation, a method that can be employed is to simply repeat the process of removing one subtest from the column vector to see how the correlation behaves. Sometimes a near-zero correlation may become very high, sometimes a high correlation tends toward zero, and sometimes (rarely in fact) the sign of the correlation is totally reversed. This should occur more often when the subtest number is low.

Generally, PCA and MCV tests share the same shortcomings such as subtest number and vector reliability issues. My best estimates of g_Fluid vector reliability is 0.566, intentionally upwardly biased, which is much lower than g vector reliability of about 0.86 (Jensen, **1998**, p. 383). As I said so many times in my previous **posts**, vector reliability is an artifact that must not be ignored. I will repeat here once again. Low reliability can decide both the sign and magnitude of the correlations. This is more likely to occur when the subtest number is low. Flynn, on the other hand, never considered this issue although he seems to acknowledge that he wasn’t very confident on his results. At the same time, he didn’t see PCA as an appropriate test beyond what it is aimed to do. Flynn (**2000**, p. 214) was right on this account :

We could then state a strong conclusion: the method of taking x in conjunction with y and z, y and z known to be genetically influenced, and then showing they all have something in common, is simply bankrupt – as a method of diagnosing whether x is genetically influenced.

This is precisely why Kan et al. (**2011**, pp. 51, 82-83) were skeptical about the method of correlated vectors (MCV) and principal component analysis (PCA). Indeed, structural equation modeling (SEM) analyses do a much better job in demonstrating the genetic origins of racial differences (Rowe & Cleveland, **1996**; Jensen, **1998**, pp. 465, 467).

**3. Results**

To begin, I have to say that I have computed some new variables especially for h2 and c2, due to their low reliabilities, by averaging WAIS and WISC estimates. What happens generally is that they show higher correlations with nearly all other variables probably because the ‘averaging method’ tends to enhance reliability. When h2 or c2 is mentioned below, it means I used the WAIS/WISC average.

For estimates, h2 reliability, as I computed, is about 0.628 or 0.548 with and without including the column averages, respectively. They are upwardly biased because I intentionally left aside the near-zero reliabilities. c2 reliability, as I computed, is about 0.480 or 0.436 with and without including the column averages, respectively. c2 reliability might not be so low at first glance but this is only because there was a ton of near-zero and negative reliabilities I had to ignore if we want to avoid over-correction.

Concerning regression analyses, as one would see below, some results appear clearly ambiguous. They must be interpreted very carefully (e.g., two things must be kept in mind, such as, low vector reliability of some variables and low subtest numbers).

**3a. Correlational analyses**

**G-fluid versus G-crystallized**. To begin, recall that WISC-Gf (using Flynn’s collection) and g-loadings correlated at zero. My WAIS-Gf variable correlated with g-loadings at about 0.70 and 0.80. As expected, the combined WISC/WAIS Gf correlates only modestly/acceptably with g-loadings (around 0.37-0.50). This pattern is curious considering Flynn’s comment on g-crystallization bias in the Wechsler’s test battery. Still, our result shows the following : when a subtest tends to be more crystallized it also tended to be more fluid. Because g_Fluid correlates with g-loadings, and that the more crystallized Wechsler subtests generally have the highest g-loadings, it follows that what Kan (**2011**) considered as being the more culture loaded tests were also the less culture loaded tests.

**Subtests’ cultural loadings**. Following Georgas et al. (2003), Kan (**2011**, pp. 41-46, 55-60) constructed a cultural loading column vector for the Wechsler subtests as well as a large variety of cognitive tests. He found that g was correlated with cultural loadings and both of them in turn correlated with subtests’ heritability and subtest black-white differences. It is surprising, still, that Kan’s cultural loading showed a Spearman correlation with g-loadings which was near-unity. Therefore we should expect g and cultural load to have the same magnitude of correlation with all other variables. Back to the present analysis, it has been found that WISC_Gf is not correlated with culture load. But WAIS_Gf correlates substantially culture load. The combined WISC/WAIS Gf showed modest/acceptable correlation with culture load. What was unexpected, as I have pointed out **previously**, is that culture load is not correlated with shared (c2), nonshared environment (e2) or adoption gains, while its relationship with shared environment (c2). It was also highly correlated with inbreeding depression and negatively with mental retardation (MR). Generally, culture load mimics almost perfectly heritable-g variables.

**Black-white differences**. A finding of interest is definitely the correlation between g_Fluid and BW difference (WISC and WAIS). The correlations lie around 0.70 and 0.80, higher than g*BW correlations. Even meta-analytic studies (Hu, **Sept.21.2013**; Dragt, **2010**) show that g*BW correlation is about 0.70, 0.80 or 0.90 (depending somewhat on the IQ batteries) after correcting for statistical artifacts known as moderating the correlations, such as vector reliability of the two vectors, range restriction of g-loadings and deviation from perfect construct validity. But because g_Fluid reliability is lower than g, the Gf*BW must be much higher than g*BW. One would wonder however why we should give some weights on these results because of the variable’s low reliability. The reason is that when looking at my data, I noticed that all of the individual samples (N=8) showed substantial positive Gf*BW correlations. None of them were negative or tending towards zero. Therefore, while acknowledging Gf poor reliability, we can be certain that the sign of the correlation between Gf and black-white difference is really a positive one, thus rendering the artifact corrections justifiable. So then, as an example, if I correct solely for Gf reliability, it is obvious that the true correlation with BW would be near unity; taken 0.75 as an average we obtain 0.75/SQRT(0.566) = 0.997. If the correlation exceeds one, anyway, this should be interpreted as to say that the true correlation is perfect, i.e., unity.

This is important because according to Kees-Jan Kan (2011) the black-white difference is larger in the more culture loaded subtests although evidence on the contrary had been summarized by Jensen (**1973**, pp. 297-312; **1980**, pp. 520-533, 546-575; **1998**, pp. 363-365, 370). But Kan acknowledged himself, like others (e.g., Jensen, Flynn, …) that g fluid is less amenable to cultural influences than is g crystallized and, besides, Rindermann & Neubauer (2001) study suggests so. To substantiate his argument, Kan (**2011**) derived a cultural (or informational) loading column vector to be correlated with other variables of interest (e.g., g-loading, BW, h2, and so forth). As I have shown in my previous post, g*culture_load correlation is near unity when using Spearman rho. The Pearson correlation is about 0.80. By way of comparison, WISC g_Fluid shows no correlation with cultural load whereras WAIS g_Fluid shows a correlation of 0.581. The combined WAIS/WISC g_Fluid correlates at about 0.293 with cultural loading. Although g_Fluid reliability is much lower than g vector reliability, the g_Fluid “corrected” would still have lower correlation with cultural loading. Furthermore, when looking at Kan (**2011**, Table 4.1) data that were taken from Jensen (**1985**, Table 5) we would notice that the Raven matrices exhibit the highest black-white difference as well as the highest g-loadings when comparing it with all of the 13 WISC-R subtests. Among the 73 tests he analyzed (Table 4.1), the only cognitive tests having larger BW differences are the SAT, ACT, and ASVAB subtests. Furthermore, Fuerst (**Sept.20.2013**) regression analyses show that BW difference and cultural loadings correlation was in fact mediated by g. On the other hand, BW*g was not mediated by cultural loadings.

When looking at SES effects (from biological versus adoptive parents) on adoptees, the variable SES_bio shows a strong correlation with BW difference while SES_adopt is modestly correlated with BW difference. BW gap had no clear relationship with mental retardation. Respectively, BW-WISC and BW-WAIS gaps correlate modestly and strongly with h2 (about 0.29-0.21 and 0.57-0.47). Nonetheless, the h2*BW correlations look much higher with Spearman rho (0.52-0.30 and 0.77-0.67). The BW difference was not correlated with either c2 or e2 (both Pearson and Spearman).

**Heritability (h2), shared (c2), nonshared (e2) environment**. As detailed **before**, the positive correlation between g and heritability is about 0.61 (uncorrected) or 0.55 (corrected), while being negatively related with e2. Due to its low reliability, the true correlation between c2 and g is uncertain, although we note a very slight trend toward negative signs. By way of comparison, WAIS_Gf and WISC/WAIS_Gf correlate only modestly with h2 while being negatively correlated with c2 and e2. This is not to say that Gf is less heritable than is Gc, but again the observed correlation must be interpreted having in mind the low reliability of Gf. Given a g*h2 of about 0.60, vector unreliability correction would yield 0.60/SQRT(0.86*0.628) = 0.816. Comparatively, given a Gf*h2 of 0.40, the correction for vector unreliability yields 0.40/SQRT(0.566*0.628) = 0.67. But it could be argued that the greater range restriction in Gf can further attenuate this correlation relative to g*h2. But if we could simply divide the 0.072 SD of Gf by the ‘population’ SD of the Wechsler g-loadings at 0.128, thus yielding 0.56, it is obvious that Gf*h2 will easily attain 100%.

Furthermore, I correlated g_Fluid with the heritability in the 42 test batteries used in the Minnesota Twin Study data given by Johnson & Bouchard (**2007**, **2011**) by using the Raven’s matrices correlation with the 41 remaining IQ (sub)tests to create the Gf (column) vector. So, the Gf*h2 and Gf*e2 correlations were about the same extent to that displayed by g with h2 and e2, that is, around +0.50 and -0.50, respectively. Definitely, more (very) large IQ batteries is needed to have a clearer picture on all this.

Due to the ridiculously low reliability of c2 vector, it was necessary to investigate the question of a possible Jensen effect (Gf) on c2. Thus I have correlated all the Gf individual samples with all c2 individual samples plus their averages, yielding a 11*11 matrix. The tendency is that we have more negative signs that the reverse. The positive correlations come mainly from LaBuda (1987) and Owens/Sines (1970), the latter having a very low sample size. For WAIS/WISC_c2 averaged from all of the 8 samples, the general tendency was that of a negative Jensen effect. But more samples are still needed to clarify this issue.

With regard to the secular gains, h2 as well as c2 had negative correlations with FE gains while e2 has positive correlations. My interpretation is that IQ stability is conditioned by both heritability and shared environment while IQ instability is conditioned by nonshared environment (Bishop et al., **2003**; Beaver et al., **2013**). Heritability also contributes to some extent to the IQ instability as well, over the course of development. With regard to mental retardation, I have summarized the results elsewhere.

**Inbreeding depression**. It correlates substantially with g_WISC and g_WAIS but modestly with all indices of g_Fluid and goes down to zero when using Spearman rho. But again, because our g_Fluid vector reliability is so weak, we shouldn’t focuse on this too much. The fact that Inbreeding D correlated so modestly with h2 (r=0.26, rho=0.00) is again best explained by the low reliability of, at least, h2. It also correlates with BW gap in WISC (about 0.40) but not with BW gap in WAIS; keep in mind however that BW gap in WAIS is based on only one sample (N=1880). Once again, more reliable data is needed. Inbreeding was correlated with biological parents’ SES effect on adoptees, not with adoptive parents’ SES effect on adoptees, consistent with genetic g hypothesis. It has some small correlations with FE gains, and shows a strong negative correlation with mental retardation.

**Adoption gain**. In Intelligence and How to Get It (2009, pp. 240-241, footnote 33) Nisbett challenged Jensen’s (**1997**) analysis of Capron & Duyme adoption study (**1989**, **1996**). Inspired by Flynn’s rebuttal to Rushton PC analysis (1999), Nisbett’s expectation was that, since g_Fluid showed positive relationship with FE gains while g is negatively correlated with FE gains, the g and g_Fluid correlation with adoption (IQ) gains would have opposite signs as well. The right answer should be : “it depends”. If g and g_Fluid correlate substantially (e.g., 0.80 or 0.90) there will certainly be a high degree of chance that this can occur. But if g and g_Fluid correlation is just modest or acceptable (e.g., 0.30 or 0.50) there is no reason to expect this pattern, especially with small subtest number. Indeed, when I look at my data closely, WISC_Gf correlates at 0.400 and 0.142 with SES_bio (i.e., SES effects of biological parents on the adoptees) and SES_adopt (i.e., SES effects of adoptive parents on the adoptees), respectively. WAIS_Gf displays the same correlations, 0.694 and 0.004. The absence of g*adopt correlation may suggest the absence of Jensen effect on shared environmental influences (c2). This awaits replication.

Incidentally, SES_bio has a substantial correlation with h2 while being negatively correlated with c2 and e2. Unexpectedly, SES_adopt correlates modestly with h2, although it correlates strongly with c2 but negatively with e2, as one would expect. Also ambiguous is the fact that both SES_bio and SES_adopt correlated positively with some FE gains and negatively with some others. Both the small sample sizes of the adoption data and the modest reliability of FE gains can explain this huge variability. Finally, and as would expect g-theory, SES_bio correlated negatively with mental retardation (MR) scores while SES_adopt shows no correlation at all.

**Secular gains**. Flynn Effect gains were not always consistent in their correlations with other variables. For instance, FE gains in US, Germany, and Austria correlate negatively with g loadings whereas Scotland gains showed positive correlations. For the latter, we should keep in mind that Scotland had data for only 6 subtests, instead of 10 for the other Flynn effects. Having said this, some vectors of gains correlate with vectors of BW difference and some others do not. Also, some had positive correlations with WAIS/WISC Gf and some had negative correlations. A first predictable answer is obviously the method of averaging WISC and WAIS Gf, since WISC-Gf was correlated with FE gains. A second predictable answer is the modest reliability of these Flynn effects. Ignoring the Scotland sample, the FE_gain vector reliabilities are as follows : 0.459, 0.462, 0.704, 0.725, 0.536, 0.757. The mean being 0.607. Larger batteries should have higher reliabilities, thus diminishing the probability of finding outliers. In this regard, Flynn’s method of averaging all Flynn gains (into a single, unique variable) is not a bad idea but I don’t think Rushton’s choice of not having averaged the FE gains in PCA is wrong either.

Whereas WAIS/WISC Gf correlations with FE gains were inconsistent, WAIS Gf had only large negative correlations with FE. And, as said earlier, WISC Gf correlates positively with nearly all the Flynn gains except US#1 and Scotland gains. A striking feature is that secular gains correlate negatively with cultural loadings. Among the many explanations of the Flynn effect, genetic causes have to be discarded. All that remains is either cultural changes (e.g., Wicherts, **2004**) or biological (Lynn, **2009**). As surprising as it appears, evidence shows that biological-environmental factors’ effects are not correlated with g (Metzen, **2012**). Thus if cultural change over time is the most likely explanation behind the absence of g-loadedness of FE gains, implying that cultural influences were unrelated with g as suggested by a large amount of evidence (Jensen, **1973**, pp. 113-117, **1997**; te Nijenhuis et al., **2007**; Ritchie & Bates, **2013**; Hu, **Aug.18.2013**), we are still wondering why g and cultural loadings are correlated. And especially why the latter is negatively related to Flynn Effects. See Discussion section.

**3b. Regression analyses**

**G-fluid, G-crystallized and culture load**. By entering g-loadings as a dependent variable, culture load and Gf as independent variable, the expected pattern would be that culture load best predicts the g-loadings. To illustrate :

If we think using WAIS_Gf (instead of WISC/WAIS Gf) would change this figure, the picture looked the same in fact. There is no doubt that culture load partially mediates Gf*g correlation. But like I said before, the correlation between g and culture load was about 0.80 or 0.90. On the other hand, when entering Gf as dependent variable, culture load and g-loadings as independent variables, g predicts Gf while culture load had negative correlations.

If culture load is best predicting g-loadings, it wasn’t the case in Johnson/Bouchard (**2011**) data on 42 subtests correlations from 3 large batteries (CAB, HB, WAIS) from which intercorrelations I computed the 42 subtests’ g-loadings. Gf partially mediates culture load in the prediction of (my own estimated) g-loadings. On the other hand, when using Kan’s g-loadings as dependent variable instead, culture partially mediates Gf in predicting g-loadings (see attached XLS, sheet #1). I don’t have a clear idea about the reasons. Kan’s g-loadings and my g-loadings are poorly correlated; 0.73 seems high but it is incredibly much smaller than the usual g-loading vector reliability.

In any case, when culture and Gf (or g instead) had to predict h2, there is no mediation among the two predictors when MZA (twins) is used as estimates of h2. When Johnson et al. (**2007**) own estimates of genetic influences is used instead of MZA, both g anf Gf appear to (partially) mediate culture load in the prediction of h2.

**Black-white differences**. Fuerst noticed that BW*culture_load is mediated by g while the reverse does not stand. I confirm this. Adding to this, we can see that culture load is also (partially) mediated by inbreeding D, or by Gf, or by SES_bio in the prediction of BW difference. The picture below is self-explanatory.

When using BW-WAIS difference (instead of BW-WISC) the picture looks the same for the effect of g, Gf and SES_bio which fully mediate culture in its relationship with g. However, inbreeding D had a regression coefficient of zero, not surprising since we know that inbreeding*BW-WAIS bivariate correlation was also zero. With regard to h2, there were ambiguities because h2 fully mediates culture in the prediction of BW-WAIS gap when at the same time culture fully mediates h2 in the prediction of BW-WISC gap. In general, and perhaps due to unreliability issues, there is actually no certainty that indices of heritability mediate culture*BW correlation.

**Adoption gains**. Some results may appear very odd, but given the small sample size of the adoption data, we shouldn’t put too much weight on it. Now, using h2 and g (WAIS or WISC) along with culture load to predict SES_bio, I obtain the following :

These numbers are self-explanatory. g constantly had more explanatory power than h2. Culture explains nothing. Next, I use Gf instead of g to predict SES_bio. Depending on whether we use WISC or WAIS, culture load is either partially or fully mediated by the other variables, namely h2 and Gf, both of them having equal preditive power. In reality, I don’t even need to include h2. When only culture load and g are entered as independent var., g again fully mediates culture load. When I use WAIS_Gf or WAIS/WISC_Gf instead of g, the output is that culture load is only partially mediated or not at all, respectively.

In trying to replicate this result using inbreeding D instead of h2, the same picture emerges, that is, g having a strong relationship with SES_bio and inbreeding D a modest relationship with SES_bio. Similarly, when I use inbreeding D, culture load and Gf to predict SES_bio, the picture bears some resemblance except that culture has a modest positive coefficient which disappears if I use WAIS_Gf instead of WAIS/WISC_Gf.

The odd thing emerges when using SES_adopt as dependent var., with h2, culture load and g as predictors, h2 always has a strong correlation with SES_adopt, while the other variables had no relationship at all. More or less the same picture is apparent when using Gf instead of g. The only difference is that h2 has less predictive power.

**Heritability and inbreeding depression**. If h2 and inbreeding D are indeed correlated, it remains to be seen whether or not g will mediate this relationship, that is, if g is central in the relationship between the two indices of heritability. This appears to be the case when g and h2 are entered as independent var. with inbreeding D as dependent var., or when g and inbreeding D are entered as independent var. with h2 as the dependent var., while remembering that the correlation between g and h2 (or inbreeding D) is much stronger than the relationship between h2 and inbreeding D.

Next, I use g-fluid along with inbreeding D to predict h2 and g-fluid with h2 to predict inbreeding D. At first glance it doesn’t seem that g-fluid acts as a mediator of the relationship between h2 and inbreeding D but we had to keep in mind that the low reliability of Gf may under-estimate the importance of Gf.

When I use culture load and g-loadings to predict h2, the picture is that g*h2 correlation was partially mediated by culture load. Similarly, when using g-fluid, the contribution of g-fluid is also partially attenuated, but less. Generally, Gf constantly had more predictive power than Gc to predict h2 while controlling for culture load.

The picture is quite similar when h2, g, and culture load are entered altogether to predict inbreeding D, the g and culture load both significantly predict inbreeding D (with h2 being zero or negative) although g tends to be partially mediated by culture load. If we use Gf instead of g-loadings, Gf has weak explanatory power. Again, what would account for the difference ? To repeat, reliability of h2 and Gf is quite low and we don’t know about inbreeding D vector reliability. Nonetheless, we know that both g and culture load strongly correlate with inbreeding D and h2. But whereas cultural loadings, and not g-loadings, best predict heritability or inbreeding depression, we see that g-loadings mediate cultural loadings in predicting SES_bio.

**Shared environment**. As explained before, c2 had very low reliability. So, these results should be interpreted carefully. When I enter both culture and g as predictors against c2, culture now becomes strongly correlated with c2 and g stongly negatively related with c2. The picture looks similar when using Gf, but to a lesser extent. Next, when I enter c2 with g or Gf to predict BW difference (either in WISC or WAIS) c2 variable is not predicting BW differences.

**Mental retardation**. When entering g and inbreeding depression to predict MR, g mediates fully or partially ID*MR. One of the regression weights is outside the usual range (-1;+1). This may occur **sometimes**, although rarely. Anyway, when I use h2 in lieu and place of inbreeding depression, g does not mediate h2*MR and MR does not mediate g*MR. Both showed negative correlation with MR.

We have seen before that culture load correlates negatively with MR. But more often than not, when culture and g have been entered as predictors against MR, g mediates culture*MR relationship.

Regarding the indices of heritability, while h2 and inbreeding D correlate negatively with mental retardation (WISC, WISC-R, WAIS, WAIS-R), g appears to mediate inbreeding D (but not h2) in the prediction of MR. Both h2 and g have negative regression weights although h2 has a tendency to partially mediate g in WISC and WISC-R. In the WAIS, h2 fully mediates g. But in WAIS-R, there is no clear mediation and both significantly predict lower score among mental retarded people. Using Gf instead of g will be useless. WISC_Gf had positive correlation with MR, WAIS_Gf negative correlation, and WISC/WAIS a correlation of zero. Again, low reliability poses a problem for such analyses.

**4. Discussion**

Concerning the above findings, some variables (e.g., MR, inbreeding depression, SES_bio and SES_adopt, and FE gains) had missing values, that is, one subtest was missing (generally it was Digit Span). Thus, when partial correlation or regression analyses were applied, the bivariate correlation column (available option in regression analyses) differing somewhat from the normal bivariate correlation, the output is based on a listwise procedure.

Overall, I am not fully satisfied with regression analyses due to the difficulty of interpreting the output when using some unreliable variables and data based on small samples. And not mentioned the missing values in some variables. Adding to this, I am not satisfied either with the culture load variable. I am much more confident in that g was mediating inbreeding D and h2 than culture load mediating both g and h2 on their relationship with inbreeding D owing to my doubts about the construction of the culture load variable. That variable behaves as if it was some sort of heritable-g index rather than a generalized cultural factor as it is supposed to approximate. The absence of any relationship with environmental indices such as FE gains, adoption gains, e2 (and perhaps c2) substantiates my point. That culture load correlates both with Gc and Gf could mean either than culture load variable is not well constructed or, rather, that Gc and Gf are indistinguishable. If that is the case, the inescapable conclusion would be that Kan’s separation of g-crystallized from g-fluid, a dichotomy that Flynn has considered before, is highly misleading. And thus not justified.

Remembering what is said in the Introduction section, it seems likely that Flynn’s WISC g-fluid is anomalous and that WAIS g-fluid can be even more accurate than is the combined WAIS/WISC g-fluid. As noted before, g-loadings in crystallized biased IQ batteries tended to be correlated with measures of fluid intelligence, let alone the fact that Raven’s Progressive Matrices often exhibit the highest g-loadings when factor analyzed along with a variety of other tests, even the supposed crystallized-biased Wechsler (Jensen, **1985**, p. 227, **1998**, pp. 38, 120, 126-127; Vernon, **1983**, Table 7; Rijsdijk, **2002**). Having factor analyzed Rijsdijk (**2002**) and Johnson & Bouchard (**2011**) correlation matrices myself, there is no doubt that Raven has a relatively high g-loading to about the same extent as the Verbal IQ (VIQ) subtests and much higher than Performance IQ (PIQ) subtests. Performing another MCV test, using again Johnson & Bouchard (**2007**, **2011**) wide test battery, the 41 (remaining) subtests’ correlations with Raven correlate with Kan’s culture load (#1 and #2) hierarchies at about the same magnitude it correlates with g which in turn is also correlated with culture load to the same extent. Overall, the problem with Kan’s expectation that culture load will track g-loadings is that the evidence for higher g-loadings in favor of crystallized over non-crystallized subtests when analyzing several large and diverse batteries is not well-established (Marshalek et al., **1983**; Ashton & Lee, **2006**).

Now back to the topic of the Flynn Effect, I indeed failed to replicate Flynn (**2000**). It is not because Flynn’s method was wrong or something. What is unfortunate is that Flynn never reported g_Fluid vector reliability. Knowing this would have greatly helped us understanding what was going wrong because among Must (**2003**), Rushton (**2010**) and te Nijenhuis (**2012**) who have commented Flynn’s paper, no one gave me the impression that they understood what was happening. I come to understand the Flynn/Rushton anomaly simply because I have computed the reliabilities. So then, while I was unable to replicate Flynn’s (**2000**) rebuttal to Rushton study (**1999**) I don’t believe my result is definitive. But it is surely more accurate than Flynn’s numbers due to the inclusion of additional samples.

**Appendix**

MATRIX DATA VARIABLES=g_Fluid_WAIS_WISC_AVG Inbreeding_Depression_WISC BW_WISC Flynn_Effect_WISC_AVG

/contents=corr

/N=5000.

BEGIN DATA.

1

0.227 1

0.715 0.476 1

0.028 0.175 0.128 1

END DATA.

EXECUTE.

FACTOR MATRIX=IN(COR=*)

/MISSING LISTWISE

/PRINT UNIVARIATE INITIAL CORRELATION SIG DET KMO EXTRACTION

/PLOT EIGEN

/CRITERIA MINEIGEN(1) ITERATE(25)

/EXTRACTION PC

/ROTATION NOROTATE

/METHOD=CORRELATION.

McKay & McDaniel (**2006**, p. 546) as well as te Nijenhuis & van der Flier (**2005**) demonstrate, by way of method of correlated vectors, that the cognitive loading of the jobs performed correlated in fact with racial differences. Complexity of the job increases with black-white differences or with minority-majority differences. This is fully consistent with the Spearman-Jensen theory, to the extent that differences increase with g (Jensen, **1998**, pp. 377-378). This shows that the importance of g is not reduced to the within-group dimension.

Jensen (**1998**, pp. 276-277, 280, 283, 286) tells us that subtests’ g-saturations of cognitive batteries are correlated with predictive validity coefficients (i.e., correlation with measures of social outcomes) of those same subtests. In the same way, NLSY97 data shows that ASVAB subtests’ g-loadings correlate with their correlations with (GPA) grade point average (Hu, **Sept.21.2013**). This holds true whether we look at the black, hispanic or white sample. Furthermore, controlling for both parental income and parental education reduces this correlation just a little.

However, if past research demonstrates the preponderance of g (e.g., Ree & Earles, **1991**, **1994**), it remains possible that the importance of specific/narrow latent abilities had been underestimated. Reeve (**2004**) tells us that previous methods employed failed to properly isolate specific cognitive latent factors from the general cognitive latent factor due to variances attributable to random measurement errors : “Often studies of cognitive abilities have relied on the observed subscales of a test (as defined at the discretion of test constructor) as a construct-valid surrogate for narrow ability constructs (e.g., Hunter, 1986; Thorndike, 1991). … studies of the validities of narrow abilities often estimate these factors with the variance due to g included. Thus, the correlations with outcomes reflect both the variance due to g as well as the unique variance due to the specific factor.” (p. 625). Thus he proposes the use of latent variable models with structural modeling equation (SEM) since they would evaluate more precisely these specific abilities (i.e., the latent (non observed) factors are deemed to represent the variance shared by the observed (measured) variables included). The conclusion of his analysis (pp. 635-639) is that the general cognitive latent factor is the predominant predictor of general (not narrow) knowledge, that the specific cognitive latent factors add no additional variance to the general knowledge factor, that the specific factors had correlations with some domains of specific knowledge even if some of these structural path coefficients approach zero. The importance of specific latent abilities on specific criterion of knowledge having been established, he noted : “Although narrow abilities may be important predictors of some narrow criterion factors, this does not necessarily indicate that these criterion factors are practically meaningful. Indeed, the general knowledge factor, which was predicted solely by g, accounts for the majority of the total variance underlying the criterion construct space. Thus, the narrow abilities, despite their evident psychological significance, may only have practical significance in situations where the range of general ability has been restricted (e.g., among a group of doctoral graduate students, where prescreening produces significant range restriction in g).” (pp. 639-640). Finally, one limitation of his study is that some tests of knowledge used had a low reliability and this could have limited somewhat the relationship between factors of specific knowledge and specific abilities.

Reeve (**2004**) and Coyle (**2013**) noted that because non-g factors or residuals predict narrow abilities, this pattern is not inconsistent with the investment theory for which non-g ability is assumed to reflect specific skills that contribute to specific abilities. Investing in skills in one domain comes at the expense of developing abilities in other domains, thus producing negative relations between these skills and those other, competing abilities.

Hence the contribution of non-g abilities is expected to be negligible above g. And this has been confirmed again. From a different angle, the importance of measurement error has been highlighted in Brown et al. (**2006**, Table 5) SEM study as well :

As noted by Schmidt et al. (1981), theory-driven research should examine validities at the true score or construct level. The true-score level refers to the relationship among the constructs free from measurement error and other statistical biases. Examining validities calculated on imperfect measures often produces an inaccurate picture of the relative importance of the abilities themselves. This occurs because partialling out imperfect measures does not fully partial out the effects of underlying constructs (Schmidt et al., 1981). To illustrate, suppose that Ability A is a cause of training performance but Ability B is not. Suppose further that the tests assessing these abilities have reliabilities of .80 and are positively correlated (as occurs with all mental ability tests). Because Ability B is correlated with Ability A, Ability B will show a substantial validity for training performance. Moreover, because Ability A is not measured with perfect reliability, partialling it from Ability B in a regression analysis would not partial out all of the variance attributable to Ability A. Thus, the measure of Ability B will receive a substantial regression weight when in fact the construct-level regression weight is zero. That is, Ability B will predict training performance even though it is not a true underlying cause of training performance. In this case, Ability B will appear to increment validity over Ability A only because of the presence of measurement error (see also Schmidt & Hunter, 1996).

Three levels of abilities can be distinguished : specific aptitudes (6 ASVAB subtests), general aptitudes (3 latent common factors as the sum of previous six ASVAB subtests), GMA (the equally weighted sum of the 3 general aptitude scores). The authors attempt to test whether specific mental abilities have increment validity for predicting performance above and beyond general mental ability (GMA or g). The specific aptitude theory would predict the pattern displayed in Figure 2 to be the best fit model over Figure 1 while g-theory predicts no better fit for Figure 2. But in fact, the inclusion of either general or specific aptitudes yields no prediction gain over and above GMA itself. Moreover, as expected, g has higher validity in training programs of higher complexity.

The critics once claimed that experience in work will become more important over time, but the data shows the opposite. The correlation between experience and performance declines over the years (Schmidt & Hunter, **2004**, p. 168) and when job complexity increases (Gottfredson, **1997**, p. 83). Furthermore, the predictive validity of IQ (g) does not decline with increases in levels (years) of job experience.

Evidence that complexity in IQ was the chief ingredient in the correlation of intelligence with success is provided by Ganzach et al. (**2013**) findings that occupational complexity (and hence high-IQ persons) mediates the correlation between g and income, and that job complexity mediates correlations between IQ and job performance (Gottfredson, **1997**, p. 82). The decline in standard deviation (SD) in IQ scores with increasing occupational level also supports this interpretation (Schmidt & Hunter, **2004**, p. 163) as it indicates that success is closely related to g, through an increasing minimum cognitive threshold. Another piece of evidence (Schmidt & Hunter, **2004**, p. 163) comes from longitudinal studies showing that earlier IQ predicts (later) movement in job hierarchy. Lower-IQ persons move down. Higher-IQ persons move up. Furthermore, when persons’ IQ exceeded their job complexity level, those persons will move up into a higher-complexity job, and vice versa. Ganzach (**2011**) reached similar conclusions; SES affects job-market success only through entry pay whereas IQ exerts its effect mainly through pay trajectory. Again, this pattern is more consistent with causal-IQ theories than with social theories.

Among the many critics of IQ, we can bring out the so-called “threshold hypothesis” which states that beyond a certain level of IQ (e.g., 120), IQ has no correlation with social outcomes. But Jensen (**1980**, pp. 318-319; **1998**, pp. 289-290) indicate several times that this idea is not empirically confirmed because IQ is related linearly with academic achievement at any given level of IQ. We have a graphical illustration of this below :

While the critics invoke the general idea that the correlation between IQ and social outcomes is superfluous because it was confounded by socio-economic status (SES) and other familial-related influences. The problem is well understood when testing the said hypothesis. Murray (**1998**) showed previously in a large sibling study that the sibling having the highest (lowest) IQ will end to a higher (lower) level of socio-economic status relative to the other. Obviously, because siblings share the same parents and thus familial influences, SES-related arguments must be discarded. In parallel, the fact that SES has no considerable impact on the correlation between IQ and scholastic tests is justified by the partial correlation analyses (with SES partialled out) that tend to reduce just slightly the initial bivariate correlation (Jensen, **1980**, p. 336). Also, IQ correlates more with adult’s attained SES than with his parents’ SES (Jensen, **1973**, p. 236, **1998**, p. 384; Herrnstein & Murray, **1994**, ch. 5 and 6; Saunders, **1996**, **2002**; Strenze, **2007**, Table 1, fn. 9). This suggests of course that parental SES is not the driving factor but instead that IQ is determing SES, more than the reverse. This is consistent for example with the failure of adoption studies (e.g., Capron & Duyme, **1989**, **1996**) to show a Jensen effect (i.e., relationship with g) on adoption gains when low-SES abused children had been adopted by rich families (Jensen, **1997**; Hu. **Feb.19.2013**). A similar conclusion could be drawn from the path analysis that has been done by Rice et al. (**1988**) who report that environmental familial influences, measured by indices of HOME, on children’s IQ, show direct environmental effects at 1 year of age, and then direct environmental effects as well as indirect genetic effects through parental mediation at 2 years of age, and then only indirect genetic effects at 3 and 4 years of age. This would suggest that environmental influences must be more effective at earliest ages (infancy rather than early childhood). All this must be considered with the fact that educational interventions usually **fail** to improve g on the long run (see, Ritchie & Bates, **2013**).

Also, Gottfredson (**1997**) provided indirect proof that g is likely to be a causal entity. For instance, learning/training does not improve general productivity at job because g itself already accounts for a large explained variance. That makes sense about why learning improves only specific abilities because gains are not transferred to non-trained, novel tasks. Understanding the concept of g as the capacity of dealing with novelty, non-routine tasks, helps us to understand why the many hopes about job training can’t be realist. In parallel, it is consistent with the many failed replications of experiences in working memory training to raise IQ and more specifically g by generalizing the gains to the other, non-trained cognitive tasks (Hu, **Nov.1.2012**). As Murray (**2005**) already explained :

Suppose you have a friend who is a much better athlete than you, possessing better depth perception, hand-eye coordination, strength, and agility. Both of you try high-jumping for the first time, and your friend beats you. You practice for two weeks; your friend doesn’t. You have another contest and you beat your friend. But if tomorrow you were both to go out together and try tennis for the first time, your friend would beat you, just as your friend would beat you in high-jumping if he practiced as much as you did.

Furthermore, Gottfredson points out (**1997**, pp. 86, 91-92, 108) how it is difficult to reduce individual differences in performance by way of training and learning, insofar as those differences can be magnified instead of decreasing (see also, Ceci & Papierno, **2005**). The reason is that intelligent people learn at a faster rate, even when all employees were exposed to the same instruction. Needless to say, this is entirely consistent with the concept of a causal g but not with theories implying that g is built from the outside.

Now, regarding the possible mechanisms related to IQ-achievement correlation, processing speed is likely to be a good candidate. According to Jensen (**1998**, **2006**), processing speed could well be the purest manifestation of g insofar as it reflects the quality of information processing in the brain which is important in light of the Spearman’s hypothesis because it seems that ECT mediates black-white differences on IQ (Pesta & Poznanski, **2008**). We already know there is a significant correlation between IQ and processing speed (Jensen, **1998**, pp. 234-238, **2006**, pp. 160, 171-172, 195; see also, Grudnik & Kranzler, **2001**; Sheppard & Vernon, **2008**), that this correlation was almost entirely mediated by genetic factors (Jensen, **1998**, p. 233, **2006**, pp. 130-131; Lee et al., **2012**), that elementary cognitive tasks are less amenable to cultural, learning and personality factors (Rindermann & Neubauer, **2001**) while the latter did not predict processing speed (see also, Jensen, **2006**, pp. 175-178), thus making these tests an even better measure of intelligence.

And yet we had to determine the causal mediation. First of all, Luo & Petrill (**1999**) showed that psychometric g and chronometric g are similar in the sense that the intrinsic nature of g is not altered when ECTs are added to the traditional psychometric tests. It also appears that the memory processing component contributes substantially to g estimates. Next, Luo et al. (**2003a**) found that CAT group factors (as opposed to the general factor) are not important in predicting achievement (measured by MAT), while at the same time the g factor derived from CAT (call it CAT-g) affects achievements measures essentially through the genetic paths, as assessed by the more substantial Chi-square changes (to assess model-fit) for the genetic path than for the shared and non-shared paths (Table 8 below). Subsequently, Luo et al. (**2003b**) were able to demonstrate that psychometric g, as measured by WISC (call it WISC-g), while being correlated with achievement (MAT variables), was substantially mediated by processing speed (as measured by CAT). In addition, some nonchronometric, memory processing measures had high loadings on the mental speed component. Overall, such findings were coherent with Jensen’s theory of mental speed (see, Jensen, **2006**, pp. 212, 216-217, 224, 226-227) which also predicts synergistic interaction between memory processes and speed. The reason could be that information processing speed allows one to consolidate more of the informations to which they are exposed in long-term memory (LTM). Besides, the concept of consolidated IQ gains vs. non consolidated gains had been explicited by Jensen long ago (**1973**, pp. 79-97).

Consistent with Dasen Luo’s conclusions, Rindermann & Neubauer (**2004**, pp. 581-586) estimated that the direct effect of processing speed of information on scholastic performance while being significant was weak compared to the indirect effect (mediated through intelligence and creativity). As shown by Rindermann et al. (**2011**), mental speed bears a non-trivial relationship with writing ability, controlling for intelligence and parental education and books at home, even though the impact of intelligence is much stronger than speed, when a latent variable approach is conducted. Although parents’ education and books at home draw a pathway towards mental speed, the authors assume that they can have such impact only through a genetic-biological (not environmental) pathway (see Luo et al., **2003a**) because mental speed must depend on a basic biological component (e.g., Penke et al., **2010**). In a similar vein, Rohde & Thompson (**2007**) have been able to show that specific cognitive abilities such as processing speed, working memory, and spatial ability, affect in an indirect way the scholastic performance (as measured by GPA, SAT-verbal, SAT-math, WRAT-III) through intelligence (as measured by Raven and a Vocabulary test). When specific cognitive abilities had been controlled, general intelligence still add to the explained variance in scholastic performance; also, when measures of general intelligence had been controlled, specific abilities do not account for any additional variance to the WRAT and GPA but add to the prediction of SAT. Having noted this, the authors explain that specific abilities can still affect academic performance beyond general cognitive ability. Another study, by Vock et al. (**2011**), reports that mental speed only has an indirect impact on the academic performance through reasoning and divergent thinking (DT, an index of creativity) while short-term memory (STM) had both direct and indirect influences (through reasoning and divergent thinking). Direct paths from mental speed to achievement performance were not significant. Interestingly, the indirect effect of mental speed diminishes when short-term memory had been included in the model because, as they say, these two variables share common variances. The authors recall however regarding short-term memory that only the active storage (as opposed to passive storage) serves as a good predictor of general cognitive ability or scholastic performance. Besides, the relationship is not expected to be distorted by timed testing as demonstrated by Preckel et al. (**2011**) when they found divergent thinking (DT) and reasoning ability to be mediated by mental speed. Finally, the only exception comes from Dodonova & Dodonov (**2013**) who failed to replicate previous research. They instead found that processing speed and intelligence have each unique contribution on school achievement.

The importance of processing speed as a mediator of g has been highlighted by Coyle et al. (**2011**) in the large NLSY97 sample. They found that the direct path from age to the development of g was not significant. Instead, the improvement of processing speed (PS) was associated with increases in g (over the course of development) and the effect of age on g was virtually fully mediated by processing speed (note : the model 4, below, showed the best fit to the data).

Although processing speed strongly mediated the development of g in our study, it may have done so indirectly through cognitive processes such as encoding, inhibition, or retrieval. These other cognitive processes have been found to be related to each other and to processing speed (Carroll, 1993). However, processing speed is assumed to broadly constrain cognitive development by limiting the speed of all cognitive processes (e.g., Kail, 1991). Thus, processing speed would also be expected to constrain the development of g, which reflects all cognitive processes (cf. Jensen, 2006; Kail, 2000).

Our findings are consistent with theories emphasizing the role of processing speed in children’s cognitive development (Jensen, 2006; Kail, 1991, 2000). These theories assume that increases in processing speed contribute to global improvements in cognition, which are observed as increases in g. Such improvements in cognition and g have been attributed to neural changes including increases in nerve conduction velocity, neuronal oscillations, and brain myelination (Jensen, 2011; Miller, 1994).

A similar pattern has been observed previously. Nettelbeck (**2010**) managed to replicate Kail (**2007**) earlier SEM study in which age causes PS which in turns causes greater working memory (WM) capacity which in turns causes higher reasoning ability (e.g., as measured by Raven’s or Cattell Culture Fair Test). This model appeared to better fit (i.e., best explain) the data. Besides, in Nettelbeck’s sample, aged adults have seen a decline in reasoning ability due to slower processing speed but also by age-related factors influencing working memory but independent from processing speed.

Further considerations had been provided by Demetriou et al. (**2013a**, pp. 40, 42, 44; see also, Coyle, **2013**; Kail, **2013**; Demetriou et al., **2013b**, **2013c**) SEM analyses which established that processing speed predicts g-fluid better than working memory during transitional periods when new mental abilities were created whereas working memory predicts g-fluid better than speed during stable periods when existant abilities are consolidated and more strongly related to each other. Both PS and WM are relevant because speed of processing allows people to handle information flow more efficiently during problem solving whereas working memory allows people to represent and process more information units at the same time. In general, the description of their analyses (for the study in the middle childhood phase) may be best summarized as follows :

In the model fit on the longitudinal data, we used the speed and the working memory scores of the first testing wave and the gf scores of the second testing wave. That is, in the first group, speed and WM at 6, 7, or 8 are used to predict gf at 7, 8, and 9 years. In the second group, speed and working memory at 9 or 10 are used to predict gf at 10 or 11 years. To test the assumption that the structure of abilities does not vary with time but their relations might vary, we constrained all relations between measures and factors to be equal across the two groups and we let the structural relations vary freely. It is recognized that the relatively small number of participants in the two age blocks compared may weaken the statistical power of structural relations. To compensate for this problem the number of measurement in these models was kept to a minimum.

The fit of this model was excellent (see Table 1). The relations between constructs are patterned as expected. In the younger age group, the relations between age and speed (−.71), and gf (.48) were significant and much higher than in the older age group (−.14 and 0, respectively). However, the relations between working memory and speed and working memory and gf were much higher in the older (−.44 and .64) than in the younger age group (−.28 and .49, respectively). Thus, it seemed that in the first phase speed reflected age changes because in this phase children became extensively faster in processing and relatively better in gf. In the next phase, gf changes converged increasingly with working memory, reflecting an across the board expansion of thought towards the capacity indexed by WM.

This has implications for the Spearman’s law of diminishing returns for age (SLODR-age) since increasing g “results in differentiation of cognitive abilities because excessive g allows for investment into domain-specific learning, thereby fostering domain autonomy” (p. 38) which tends to converge with Woodley (**2011**) CD-IE thesis. Demetriou’s general discussion is worth citing as well :

Our findings about structure confirmed both the SLODRage prediction that cognitive processes differentiate from each other (prediction 4a) and the developmental prediction that they become increasingly coordinated with each other (prediction 4b). Differentiation was suggested by the fact that, with age, different types of mental processes are expressed through process-specific factors rather than through a more inclusive representational factor. These process specific factors tended to relate increasingly with a general factor at a subsequent phase, reflecting an integration of previously differentiated processes. This concurrent differentiation/integration of cognitive processes necessitates a redefinition of the nature of cognitive development. Our findings suggested that intellectual power increases with development because cognitive processes and reasoning evolve through several cycles of differentiation and integration where relations are dynamic and bidirectional. According to the present and other research (Demetriou, 2000; Demetriou & Kazi, 2001, 2006; Flavell, Green, & Flavell, 1995), differentiation goes from general cognitive functions to specific cognitive processes and mental operations. Integration follows the trend, focusing on increasingly specific processes and operations. Differentiation of cognitive processes allows their control because they may be individually regulated according to a goal-relevant plan. Integration of mental operations generates content-free inferential schemes that can be brought to bear on truth and validity. In both differential and developmental theories, the differentiation/integration process always applies on inferential processes. Moreover, in developmental theory, the state of their coordination frames the functioning of all other cognitive processes, such as language, mental imagery, and memory, imposing a stage-specific overall worldview (Piaget, 1970).

At the biological level, Penke et al. (**2012**, Figure 2) conducted SEM analysis to test the possible pathways from three different indicators of white matter tract integrity to general intelligence (g). They discovered that this path was fully mediated by information-processing speed. They made this assumption since efficient information processing between distal brain regions is thought to rely on the integrity of their interconnecting white matter tracts. Well-connected white matter favors efficient information processing. See also Jung & Haier (**2007**) who emphasized on brain functional connectivity to explicate the neural basis of intelligence.

This aside, IQ-achievement tests of causality being the most needed, Watkins et al. (**2007**) conducted such a test using SEM to estimate the arrow of causality of IQ (test) latent factors and achievement (test) latent factors across two points in time. In sum, a cross-lagged path analysis. Their model 2 (M2) in which each of the IQ latent factors at Time #1 had direct paths to the IQ latent factors and achievement latent factors at Time #2 (with Achievement1→IQ2 path being not significant) has been selected as the best-fit model to the data, whereas alternative models (e.g., M3, which assumes direct path from achievement latent factor Time #1 on both achievement latent factor and IQ latent factor at Time #2) were not.

Previous studies of the same nature, although rare, are consistent with Watkins (**2007**). Jensen (**1980**, pp. 241, 324, 336-337) cited a study by Crano (1972) although its critics had been summarized by Watkins, notably blaming this study for using sub-optimal methods. On the other hand, the other cited study was a path coefficient analysis by Li (1975) as shown below :

If father education and occupation each have a direct coefficient of 0.20, this implies that the two variables explain 0.20^2+0.20^2=0.08 or 8% of variation in childhood IQ. Moreover, to the extent that the coefficient from childhood IQ to childhood education is 0.44, the coefficient from childhood education to adult IQ being 0.25, then the indirect effect of education on adult IQ would be estimated at 0.44*0.25=0.11 or 11% of the variance. Apparently, education wasn’t really a strong causal determinant.

Consistent with the idea that SES is not of so great importance, Brodnick & Ree (**1995**) investigated the structural relationship between g (SATm, SATv, ACT) and SES (income, family size, parent’s age) and academic performance (AP, diverse GPAs) by way of CFAs. Among these 3 latent variables, they found that SES had no explanatory power. The model which best explained the data is a model g+AP in which there was no direct path connecting the 3 SES observed variables or even the SES common latent factor to the latent AP or its observed elements. g has a path of 0.672 going to AP, thus accounting for 0.672² or 45.2% of variance in AP, with D being the unexplained variance of AP which is 0.740² or 54.8% (i.e., 45.2%+54.8%=100%). Because a g-only model shows no fit decrement as compared with a g+SES model, this would mean that using a separate SES latent variable when g had been incorporated is not very useful. That the authors used SAT/ACT to derive a latent g might be questional to some, but Frey & Detterman (**2004**), Koenig et al. (**2008**), and Coyle & Pillow (**2008**) confirmed that SAT and ACT are good proxies of g.

This result may be explained by statistical artifacts, precisely, that family size and parental age are inaccurate measures of SES, hence the absence of increment in model fit when latent SES is added. In the **NLSY97** and NLSY79, this couldn’t be replicated but the independent impact of adolescents’ intelligence was stronger than the independent impact of parental SES on outcome.

Cross-lagged analyses had been applied at the nation level as well by Rindermann (**2008a**, **2008b**, **2011**, **2012**). Earlier IQ levels predict better economic performance measured at a later time, although the Beta coefficients were not usually very high. The coefficients from early GDP to later national IQ are significant as well, suggesting that the causality runs both ways. On the other hand, early level of education predict later GDP while early GDP predicts only weakly later levels of education.

Generally, if the causal link with regard to IQ seems to be well established, the idea of the general intelligence, the g factor, as a causal entity of mental abilities is prone to uncertainty (see, van der Maas et al., **2006**). Nevertheless, in a twin study, Shikishima et al. (**2009**) tested two competing models concerning the causal role of the g factor : the independent pathway versus common pathway model. In the first model, genetic and environmental factors influence specific cognitive abilities through a direct path. In the second model, genetic and environmental factors influence specific cognitive abilities through a higher-order construct, say, the g factor. The analysis shows that the (AE) common pathway model (having the smallest AIC values) fitted their data best, meaning that the causal role of g has been established. This obviously counters claims made by scientists like Stephen Jay Gould who used to consider g as a mere statistical phenomenon.

McKay & McDaniel (**2006**, p. 546) ainsi que te Nijenhuis & van der Flier (**2005**) démontrent, via la méthode des vecteurs corrélés, que la charge ou saturation cognitive du travail effectué corrélait avec les différences raciales. Plus la complexité est élevée, plus les différences entre les blancs et les noirs ou le groupe ethnique majoritaire versus minoritaire, s’élargit. Ceci est tout à fait cohérent avec la théorie de Spearman-Jensen, sur le fait que les différences augmentent avec g (Jensen, **1998**, pp. 377-378) et démontrent que l’importance de g ne se réduit pas à l’intérieur d’un groupe.

Jensen (**1998**, pp. 276-277, 280, 283, 286) nous apprend que les saturations en g des sous-tests de batteries cognitives sont corrélées avec les coefficients de validité prédictive (i.e., corrélation avec des mesures de résultats sociaux) de ces mêmes sous-tests. De la même façon, dans les données du NLSY97, les charges en g des sous-tests de l’ASVAB-1999 corrèlent avec leurs corrélations sur les évaluations scolaires dites GPA (Hu, **Sept.21.2013**). Cela fut vrai pour le groupe des noirs, hispaniques et des blancs. Le fait de contrôler à la fois l’éducation et le niveau de revenu des parents n’a réduit cette corrélation que d’assez peu.

Cependant, si les recherches passées démontraient que le rôle pratique de g est prépondérant (e.g., Ree & Earles, **1991**, **1994**), il n’est pas impossible que l’importance des compétences latentes spécifiques aient été sous-estimée. Reeve (**2004**) nous apprend que les méthodes précédemment appliquées n’avaient probablement pas isolé parfaitement les facteurs latents spécifiques de l’intelligence du facteur latent général de l’intelligence, du fait des variances attribuées aux erreurs aléatoires de mesure : “Often studies of cognitive abilities have relied on the observed subscales of a test (as defined at the discretion of test constructor) as a construct-valid surrogate for narrow ability constructs (e.g., Hunter, 1986; Thorndike, 1991). … studies of the validities of narrow abilities often estimate these factors with the variance due to g included. Thus, the correlations with outcomes reflect both the variance due to g as well as the unique variance due to the specific factor.” (p. 625). Ainsi propose-t-il, les modèles à variables latentes, via les méthodes d’équations structurelles (SEM), évaluent plus précisément les capacités spécifiques au niveau latent (i.e., les facteurs (non observés) latents ou construits censés représenter la variance partagée par les variables (mesurées) observables incluses). La conclusion de son analyse (pp. 635-639) est que le facteur général (g) de l’intelligence est le principal prédicteur des connaissances générales, que le facteur latent spécifique n’ajoute aucune variance supplémentaire dans le facteur latent général des connaissances, que les facteurs spécifiques ont des corrélations avec certains domaines des connaissances spécifiques même si certains de ces coefficients de voies structurelles sont proches de zéro. Malgré l’importance des compétences cognitives spécifiques sur les critères spécifiques de connaissance, l’auteur tient à préciser, “Although narrow abilities may be important predictors of some narrow criterion factors, this does not necessarily indicate that these criterion factors are practically meaningful. Indeed, the general knowledge factor, which was predicted solely by g, accounts for the majority of the total variance underlying the criterion construct space. Thus, the narrow abilities, despite their evident psychological significance, may only have practical significance in situations where the range of general ability has been restricted (e.g., among a group of doctoral graduate students, where prescreening produces significant range restriction in g).” (pp. 639-640). Enfin, une limite de l’étude tient au fait qu’un certain nombre des tests de connaissance avait une faible fiabilité, ce qui pourrait avoir limité la relation entre les facteurs de connaissances spécifiques et les capacités spécifiques.

Reeve (**2004**) et Coyle (**2013**) ont fait remarquer que dans la mesure où les facteurs (ou résidus) non-g peuvent prédire les habiletés étroites ou spécifiques, ce schéma n’est pas incohérent avec la théorie dite de l’investissement (par Cattell) qui prédit que les capacités non-g reflètent des capacités spécifiques qui contribuent à d’autres capacités spécifiques. Le fait d’investir dans des capacités dans un domaine se fait au dépens des autres domaines, produisant ainsi des corrélations négatives entre ces capacités et les autres capacités dites compétitives.

Conséquemment, la contribution des habiletés non-g est suspectée d’être négligeable au-delà de g. Et ceci a été confirmé une fois de plus. Selon un autre angle, l’importance des erreurs de mesure a été considérée dans l’étude SEM de Brown et al. (**2006**) :

As noted by Schmidt et al. (1981), theory-driven research should examine validities at the true score or construct level. The true-score level refers to the relationship among the constructs free from measurement error and other statistical biases. Examining validities calculated on imperfect measures often produces an inaccurate picture of the relative importance of the abilities themselves. This occurs because partialling out imperfect measures does not fully partial out the effects of underlying constructs (Schmidt et al., 1981). To illustrate, suppose that Ability A is a cause of training performance but Ability B is not. Suppose further that the tests assessing these abilities have reliabilities of .80 and are positively correlated (as occurs with all mental ability tests). Because Ability B is correlated with Ability A, Ability B will show a substantial validity for training performance. Moreover, because Ability A is not measured with perfect reliability, partialling it from Ability B in a regression analysis would not partial out all of the variance attributable to Ability A. Thus, the measure of Ability B will receive a substantial regression weight when in fact the construct-level regression weight is zero. That is, Ability B will predict training performance even though it is not a true underlying cause of training performance. In this case, Ability B will appear to increment validity over Ability A only because of the presence of measurement error (see also Schmidt & Hunter, 1996).

Trois niveaux d’habiletés sont à distinguer : les aptitudes spécifiques (6 sous-tests ASVAB), les aptitudes générales (3 facteurs latents communs en tant que somme des 6 sous-tests), GMA (la somme à pondération égale des scores des 3 aptitudes générales). Les auteurs tentent de tester l’hypothèse que les habiletés mentales spécifiques ajoutent à la validité prédictive de la performance au travail par-dessus, au-delà g ou GMA. La théorie de l’aptitude spécifique prédit le modèle propre à la figure 2 comme étant le mieux ajusté par rapport au modèle de la figure 1 tandis que la théorie de g ne prédit pas un meilleur ajustement du modèle de la figure 2. Mais en fait, l’inclusion que ce soit des aptitudes spécifiques ou générales n’améliore pas la prédiction au-delà du facteur g. En outre, comme prédit, g montre une plus forte validité pour les programmes d’entraînement et d’apprentissage à complexité plus élevée.

Parmi les critiques du QI, on peut noter une théorie correspondant à celle du plafond de validité prédictive connu sous “threshold hypothesis” qui consiste à dire que le QI corrèle avec des résultats économiques bien réels et pertinents, mais qu’à partir d’un certain niveau de QI (e.g., 120), un niveau encore plus élevé n’est plus bénéfique. Jensen (**1980**, pp. 318-319; **1998**, pp. 289-290) avait indiqué à plusieurs reprises que ceci était erroné compte tenu que la réussite académique augmente linéairement en fonction du QI à tous les niveaux de QI. Cela s’applique aussi aux SAT scores corrélés au GPA, grade point average. Nous en avons une illustration graphique ci-dessous :

Alors même que les critiques invoquent généralement l’idée que la corrélation entre le QI et mesures de réussite sociale n’est due qu’à des facteurs externes confondants, comme les influences familiales (c2), le problème vient au moment de tester ces théories. Murray (**1998**) avait démontré par une étude de frères et soeurs (siblings), en comparant ces personnes biologiquement apparentées, que le sibling comparé ayant un plus haut QI se retrouve à des niveaux socio-économiques plus élevés. Puisque les sujets étaient frères et soeurs de même famille, les différences dans les styles familiaux et niveaux économiques ne peuvent pas expliquer cette relation. Parallèlement, le fait que le statut socio-économique (SSE, ou SES en anglais) n’a pas d’influence considérable sur la corrélation entre QI et test scolaire est autrement justifié par les corrélations partielles (contrôlant le SES) qui ne réduit la corrélation que d’un infime montant (Jensen, **1980**, p. 336). Aussi, le QI corrèle davantage avec le SSE atteint à l’âge adulte qu’avec le SSE de ses parents (Jensen, **1973**, p. 236, **1998**, p. 384). Ceci suggère par conséquent que le SSE antérieur des parents n’est pas vraiment le facteur le plus déterminant et que le QI semble déterminer le SSE, non l’inverse. C’est tout à fait cohérent avec l’inaptitude de l’étude adoption française réalisée par Capron & Duyme (**1989**, **1996**) et réanalysé par Jensen (**1997**; Hu. **Feb.19.2013**) qui démontre que le gain de QI via l’adoption par des familles socio-économique très aisées d’enfants abusés n’affecte pas les capacités générales, soit le facteur g, mais probablement des capacités spécifiques. Une conclusion similaire peut être dérivée de l’étude d’analyse causale de Rice et al. (**1988**) qui rapporte que les influences environnementales familiales, mesurées par l’indice de HOME, sur le QI des enfants, montrent des effets environnementaux directs et indirects à 1 an d’âge, puis des effets environnementaux directs et génétiques indirects par médiation parentale à 2 ans d’âge, puis des effets indirects génétiques à 3 et 4 ans d’âge. Cela pourrait suggérer que l’effet environnemental serait plus effectif durant la petite enfance que durant l’enfance. Tout ceci, généralement, est à considérer avec le fait que les interventions éducatives **échouent** généralement à affecter les capacités générales de façon durables (Ritchie & Bates, **2013**).

Aussi, Gottfredson (**1997**) a fourni des suggestions indirectes sur le fait que g serait une entité causale, comme le fait que l’apprentissage n’améliore pas la productivité générale au travail dans la mesure où g compte déjà pour une très large part de la variance explicative. Ceci s’expliquerait par le fait que l’apprentissage augmente les habilités spécifiques alors que g est censé être un facteur général d’aptitude. Les gains et bénéfices engrangés grâce à l’apprentissage ne sont pas généralisables à d’autres tâches de travail, ce qui est aussi cohérent avec les multiples échecs des expériences sur le travail de mémoire pour généraliser et “transférer” les gains de QI sur d’autres tests cognitifs dissemblables en contenu (Hu, **Nov.1.2012**). Comme Murray (**2005**) l’avait expliqué :

Suppose you have a friend who is a much better athlete than you, possessing better depth perception, hand-eye coordination, strength, and agility. Both of you try high-jumping for the first time, and your friend beats you. You practice for two weeks; your friend doesn’t. You have another contest and you beat your friend. But if tomorrow you were both to go out together and try tennis for the first time, your friend would beat you, just as your friend would beat you in high-jumping if he practiced as much as you did.

Gottfredson a souligné (**1997**, pp. 86, 91-92, 108) qu’il est extrêmement difficile de réduire les différences individuelles de performance par la pratique, lesquelles différences peuvent même augmenter dans la mesure où les individus intelligents apprennent plus vite, même à exposition et instruction identiques. Cela est cohérent avec l’idée d’un g causal, mais pas avec celle d’un g médié par des facteurs externes.

Maintenant, concernant les possibles mécanismes procédant à la corrélation QI-réussite, malgré les corrélations significatives entre vitesse de traitement de l’information et QI (Jensen, **1998**, pp. 234-238, **2006**, pp. 160, 171-172, 195; Grudnik & Kranzler, **2001**; Sheppard & Vernon, **2008**) et le fait que cette corrélation soit médiée presque totalement par des facteurs génétiques (Jensen, **1998**, p. 233, **2006**, pp. 130-131) et outre le fait que les tests cognitifs élémentaires (ECT) soient moins modifiables par les facteurs culturels et de personnalité (Rindermann & Neubauer, **2001**; Jensen, **2006**, pp. 175-178), il restait à déterminer la médiation causale. D’abord, Luo & Petrill (**1999**) rapportent que le g psychométrique (tiré des tests QI conventionnels) et g chronométrique (tiré des tests dits ECTs) sont similaires dans le sens où la nature intrinsèque de g ne change pas lorsque les ECTs sont ajoutés par dessus les tests QI psychométriques traditionnels. Il apparaît aussi que le composant de traitement de mémoire contribue à g. Par la suite, Luo et al. (**2003a**) ont découvert que les facteurs de groupes (i.e., spécifiques) du CAT, une batterie de tests censé mesuré la vitesse de traitement de l’information, n’étaient pas des facteurs importants pour prédire la performance académique, mesurée par la batterie de tests MAT, tandis que le facteur g dérivé du CAT (ou CAT-g) affecte la performance académique essentiellement à travers une médiation génétique (dits “genetic paths”) comme l’atteste les changements plus significatifs dans les valeurs du Chi-Square servant à estimer l’ajustement du model (Table 8). Luo et al. (**2003b**) démontrent que le facteur g psychométrique, WISC-g, corrélé à la performance scolaire était en vérité médié par la vitesse de traitement de l’information, représenté par CAT G, ou le facteur général du CAT. En outre, il est possible qu’il existe un autre médiateur comme le traitement de la mémoire dans la mesure où celui-ci a une forte saturation sur le composant de la vitesse mentale. Ceci est tout à fait conforme à la théorie de la vitesse mentale par Jensen (**2006**, pp. 212, 216-217, 224, 226-227) qui prédit aussi l’existence d’une interaction synergique entre mémoire et vitesse. La raison serait que le traitement rapide de l’information permet de consolider davantage d’informations exposées à la mémoire de long-terme (LTM), le concept de gains de QI consolidés versus non consolidés a été explicité par Jensen (**1973**, pp. 79-97).

Cohérent avec les conclusions de Dasen Luo, Rindermann & Neubauer (**2004**, pp. 581-586) ont démontré que l’effet direct de la vitesse du traitement de l’information sur la performance scolaire/scolastique était significative bien que faible comparé à l’effet indirect (médié par l’intelligence et la créativité) sur la performance scolastique. De la même façon, Rohde & Thompson (**2007**) sont parvenus à démontrer que les compétences cognitives spécifiques que sont la vitesse de traitement, le travail de mémoire et l’habilité spatiale agissent de façon indirecte sur les performances scolatisques (mesurées par GPA, SAT-verbal, SAT-math, WRAT-III) via l’intelligence (mesuré par Raven et un test de vocabulaire). Lorsque les capacités cognitives spécifiques ont été contrôlées, les capacités générales continuent à ajouter à la variance expliquée dans la performance académique; aussi, lorsque les mesures de l’intelligence générale ont été contrôlées, les capacités spécifiques n’ajoutent aucune variance explicative supplémentaire sur le WRAT et GPA, mais ajoute à la prédiction du SAT. Ceci étant dit, les auteurs notent que ces compétences spécifiques peuvent influer sur les performances académiques au delà des capacités générales de l’intelligence. Une autre étude, par Vock et al. (**2011**), rapporte que la vitesse mentale a un effet uniquement indirect sur la performance académique à travers le raisonnement et la pensée divergente (un indice de créativité) tandis que la mémoire de court terme avait des influences directes et indirectes (via raisonnement et pensée divergente). Fait intéressant, l’effet indirect de la vitesse mentale diminue lorsque la mémoire court terme est incluse dans le modèle, et cela s’explique par le fait que ces deux variables partagent des variances communes. Les auteurs rappellent toutefois au sujet de la mémoire de court-terme que seul le travail de mémoire comme forme de “stockage actif” (par opposition au “stockage passif”) est un bon prédicteur pour les capacités cognitives générales et la performance scolaire. Par ailleurs, la relation n’est pas suspectée d’être biaisée par des administrations de tests “chronométrés” comme démontré par Preckel et al. (**2011**) quand ils ont découvert que la pensée divergente et la capacité de raisonnement sont médiées par la vitesse mentale. Au final, la seule exception provient de Dodonova & Dodonov (**2013**) qui ne sont pas parvenus à répliquer les résultats précédents. Ils ont trouvé que le traitement de l’information et l’intelligence avaient chacun des contributions uniques (indépendantes) sur la performance scolaire.

L’importance de la vitesse de traitement comme médiateur du facteur général (g) de l’intelligence a été davantage mise en évidence par Coyle et al. (**2011**) dans un très large échantillon d’adolescents du NLSY97 en raison que l’effet direct (path coefficients) de l’âge sur le développement de g n’était pas significatif. Au contraire, l’amélioration de la vitesse de traitement est associée à l’augmentation de g et l’effet de l’âge sur g (i.e., l’intelligence augmente au cours du développement) était presque totalement médié par la vitesse de traitement (Note : le modèle 4 ci-dessous montre les meilleurs indices d’ajustement aux données).

Although processing speed strongly mediated the development of g in our study, it may have done so indirectly through cognitive processes such as encoding, inhibition, or retrieval. These other cognitive processes have been found to be related to each other and to processing speed (Carroll, 1993). However, processing speed is assumed to broadly constrain cognitive development by limiting the speed of all cognitive processes (e.g., Kail, 1991). Thus, processing speed would also be expected to constrain the development of g, which reflects all cognitive processes (cf. Jensen, 2006; Kail, 2000).

Our findings are consistent with theories emphasizing the role of processing speed in children’s cognitive development (Jensen, 2006; Kail, 1991, 2000). These theories assume that increases in processing speed contribute to global improvements in cognition, which are observed as increases in g. Such improvements in cognition and g have been attributed to neural changes including increases in nerve conduction velocity, neuronal oscillations, and brain myelination (Jensen, 2011; Miller, 1994).

Un résultat similaire a été produit auparavant. Nettelbeck (**2010**) est parvenu à répliquer l’analyse en équation structurelle de Kail (**2007**) dans lequel le modèle le plus adapté est celui où l’âge est le facteur causal des améliorations dans le traitement de l’information, qui à son tour cause une amélioration dans la capacité de travail de mémoire, et qui à son tour cause une meilleure capacité de raisonnement (e.g., mesurée par le test de Raven ou Cattell). Aussi, dans l’échantillon d’étude de Nettelbeck, les adultes âgés ont vu un déclin dans leurs habiletés de raisonnement dû à un traitement de l’information plus lent mais aussi par des facteurs liés à l’âge affectant les capacités de travail de mémoire mais indépendants de la vitesse de traitement.

Un développement de cette relation depuis l’enfance jusqu’à l’adolescence a été fourni par Demetriou et al. (**2013a**, pp. 40, 42, 44; see also, Coyle, **2013**; Kail, **2013**; Demetriou et al., **2013b**, **2013c**) via des analyses en SEM qui établissent que la vitesse de traitement prédit l’intelligence (g) fluide bien mieux que le travail de mémoire durant les périodes transitionnelles quand les nouvelles capacités mentales sont créées tandis que le travail de mémoire prédit g fluide mieux que la vitesse de traitement durant les périodes dites stables quand les capacités existantes sont consolidées et plus fortement reliées les unes aux autres. Ces deux variables sont importantes parce que la vitesse de traitement permet aux individus de mieux gérer le flux d’information durant la résolution des problèmes tandis que le travail de mémoire permet aux individus de représenter et de traiter plus d’unités d’information en même temps. En général, la description de leurs analyses (pour l’étude concernant la phase du milieu de l’enfance) peut être résumée comme suit :

In the model fit on the longitudinal data, we used the speed and the working memory scores of the first testing wave and the gf scores of the second testing wave. That is, in the first group, speed and WM at 6, 7, or 8 are used to predict gf at 7, 8, and 9 years. In the second group, speed and working memory at 9 or 10 are used to predict gf at 10 or 11 years. To test the assumption that the structure of abilities does not vary with time but their relations might vary, we constrained all relations between measures and factors to be equal across the two groups and we let the structural relations vary freely. It is recognized that the relatively small number of participants in the two age blocks compared may weaken the statistical power of structural relations. To compensate for this problem the number of measurement in these models was kept to a minimum.

The fit of this model was excellent (see Table 1). The relations between constructs are patterned as expected. In the younger age group, the relations between age and speed (−.71), and gf (.48) were significant and much higher than in the older age group (−.14 and 0, respectively). However, the relations between working memory and speed and working memory and gf were much higher in the older (−.44 and .64) than in the younger age group (−.28 and .49, respectively). Thus, it seemed that in the first phase speed reflected age changes because in this phase children became extensively faster in processing and relatively better in gf. In the next phase, gf changes converged increasingly with working memory, reflecting an across the board expansion of thought towards the capacity indexed by WM.

Ceci a des implications certaines pour la loi des rendements décroissants de l’âge de Spearman (SLODR-age, Spearman’s law of diminishing returns for age) dans la mesure où l’intelligence générale (g) “results in differentiation of cognitive abilities because excessive g allows for investment into domain-specific learning, thereby fostering domain autonomy” (p. 38) ce qui tend à converger vers la théorie CD-IE de Woodley (**2011**). La discussion générale de Demetriou vaut également la peine d’être citée :

Our findings about structure confirmed both the SLODRage prediction that cognitive processes differentiate from each other (prediction 4a) and the developmental prediction that they become increasingly coordinated with each other (prediction 4b). Differentiation was suggested by the fact that, with age, different types of mental processes are expressed through process-specific factors rather than through a more inclusive representational factor. These process specific factors tended to relate increasingly with a general factor at a subsequent phase, reflecting an integration of previously differentiated processes. This concurrent differentiation/integration of cognitive processes necessitates a redefinition of the nature of cognitive development. Our findings suggested that intellectual power increases with development because cognitive processes and reasoning evolve through several cycles of differentiation and integration where relations are dynamic and bidirectional. According to the present and other research (Demetriou, 2000; Demetriou & Kazi, 2001, 2006; Flavell, Green, & Flavell, 1995), differentiation goes from general cognitive functions to specific cognitive processes and mental operations. Integration follows the trend, focusing on increasingly specific processes and operations. Differentiation of cognitive processes allows their control because they may be individually regulated according to a goal-relevant plan. Integration of mental operations generates content-free inferential schemes that can be brought to bear on truth and validity. In both differential and developmental theories, the differentiation/integration process always applies on inferential processes. Moreover, in developmental theory, the state of their coordination frames the functioning of all other cognitive processes, such as language, mental imagery, and memory, imposing a stage-specific overall worldview (Piaget, 1970).

Au niveau biologique, maintenant, Penke et al. (**2012**) ont conduit des analyses de type SEM pour tester les possibles voies causales de trois différents indicateurs d’intégrité des tractus de matière (substance) blanche vers l’intelligence générale (g). Ils ont découvert que cette voie causale était entièrement médiée par la vitesse de traitement de l’information. Ils ont fait cette hypothèse dans la mesure où l’efficacité du traitement de l’information entre régions distales du cerveau est pensée être liée à l’intégrité de l’interconnexion des faisceaux (ou tractus) de matière blanche. Jung & Haier (**2007**) avaient également insisté sur la connectivité fonctionnelle du cerveau pour expliquer les bases neurales de l’intelligence.

Ceci dit, les tests de causalité QI-réussite étant les plus nécessaires, Watkins et al. (**2007**) ont conduit ce genre de tests en utilisant des méthodes à équations structurelles pour estimer le sens de causalité des facteurs latents des tests QI et tests d’aptitude (scolaire) entre deux périodes de temps, en somme, une analyse transversale décalée (cross-lagged path analysis). Les coefficients de voies structurelles indiquent que les facteurs latents de QI évalués à Temps #1 ont des voies directes sur tous les facteurs latent QI et d’aptitude scolaire évalués à Temps #2, tandis que les facteurs latents d’aptitude scolaire à Temps #1 avaient des voies directes sur les facteurs d’aptitude scolaire à Temps #2. Les auteurs concluent alors que le QI montre une relation causale avec l’aptitude scolaire mais que l’inverse ne tenait pas.

Des études précédentes, de même nature bien que rares, sont cohérentes avec Watkins (**2007**). Jensen (**1980**, pp. 241, 324, 336-337) cite une étude de Crano (1972) bien que plus tard critiquée Watkins pour des méthodes non optimales, mais également une analyse causale schématique (path diagram) de Li (1975) comme montré ci-dessous :

Si l’éducation et statut professionnel du père ont chacun un coefficient direct de 0.20, cela veut dire qu’ils expliquent tous deux 0.20^2+0.20^2=0.08, soit 8% de variation du QI de l’enfant. Aussi, dans la mesure où le coefficient du QI enfant vers l’éducation à l’enfance est de 0.44, le coefficient de ce dernier vers le QI à l’âge adulte étant de 0.25, alors l’effet indirect de l’éducation sur le QI adulte estimé serait de 0.44*0.25=0.11, soit 11% de la variance. Là encore, il semblerait que l’éducation n’ait pas de relation causale avec le QI.

Concernant encore la contribution du SSE (ou SES) sur la réussite, Brodnick & Ree (**1995**) ont étudié la relation structurelle entre g (SATm, SATv, ACT) et le statut socio-économique (income, family size, parent’s age) et la performance académique (AP, divers GPAs) par des analyses factorielles confirmatoires (CFAs). Parmi ces 3 variables latentes, ils ont découvert que le SES n’avait pas de pouvoir explicatif substantiel. Le modèle le plus adapté aux données fut un modèle g+AP dans lequel il n’y avait aucune voie directe connectant les 3 variables SES observées ou même le SES facteur latent commun vers AP latent ou ses éléments observables. Le g latent avait une voie de 0.672 menant vers AP, comptabilisant ainsi 0.672² ou 45.2% de variance dans la variable latente AP, avec D comme étant la variance non expliquée par AP qui est de 0.740² ou 54.8% (i.e., 45.2%+54.8%=100%). Le fait que le modèle g-unique ne montre pas d’ajustement plus faible comparé au modèle g+SES signifie que l’ajout de la variable latente SES n’est plus très utile une fois que g a été incorporé. On pourrait douter de la validité des tests SAT/ACT comme servant à dériver un g latent, mais Frey & Detterman (**2004**), Koenig et al. (**2008**), et Coyle & Pillow (**2008**) ont confirmé que le SAT ou l’ACT est un bon proxy de g.

Des analyses transversales décalées ont également été réalisées à l’échelle du QI national par Rindermann (**2008a**, **2008b**, **2011**, **2012**). Les niveaux de QI prédisent une meilleure performance économique future, comme l’attestent les Beta coefficients positifs et significatifs, bien qu’ils ne soient pas très élevés. Les coefficients de causalité (Beta) menant du PIB antérieur vers le QI national postérieur sont également significatifs, suggérant des effets de causalité réciproques entre les deux variables. D’un autre côté, les niveaux antérieurs d’éducation prédisent le PIB postérieur tandis que le PIB antérieur ne prédit que faiblement les niveaux d’éducation futurs.

Si la relation causale du QI semble bien établie, l’idée du facteur général de l’intelligence, le facteur g, comme entité causale des capacités mentales est plus incertaine (van der Maas et al., **2006**). Néanmoins, dans une étude de jumeaux, Shikishima et al. (**2009**) ont testé deux modèles concurrents concernant le rôle causal du facteur g : le modèle à voie indépendante et le modèle à voie commune. Dans le premier, les facteurs génétiques et environnementaux influencent les capacités cognitives spécifiques par une voie directe. Dans le second, les facteurs génétiques et environnementaux influencent les capacités cognitives spécifiques par l’intermédiaire d’un construit d’ordre supérieur, le facteur g. Les données ont révélé que le modèle à voie commune montrait le meilleur ajustement aux données, d’où il en a été conclu que le rôle causal de g est bien établi, contrairement à ce que prétendaient certains scientifiques comme Stephen Jay Gould.

To begin, the never-ending NLSY79 syntax can be found **here**, and the never-ending NLSY97 syntax **here**. My (**EXCEL**) file contains all the never-ending list of calculations and results described and explained in the following paragraphs. I do not produce any screenshot here since there were too much numbers everywhere. First of all, it seems that the use of weights (sampling weight) may have an impact on the magnitude of the racial gaps and to a lesser extent the magnitude of correlations. So, I calculated all the gaps with and without weights, but I haven’t applied the weights for the subtest intercorrelations and factor analyses, since the NLSinfo does not (**apparently**) recommend this when doing correlational analyses but recommend it for tabulating the characteristics of a given population (e.g., means, totals, proportions). In any case, I am taking into account the effect of age, sex, and parental SES (i.e., parental income and years of education) on the ASVAB subtests. Therefore, I produced age-regressed (out) ASVAB subtests, age/gender-regressed (out) ASVAB subtests, age/SES-regressed (out) ASVAB subtests, and age/gender/SES-regressed (out) ASVAB subtests.

I will start with some anomalies. Concerning the Jensen effect in BW black-white differences, the correlation with g is about 0.30 in NLSY97 and 0.40 in NLSY79. For the latter, however, if we regress out simultaneously age and gender variables, the correlations were about or around 0.00 and 0.10. Undoubtedly, MGCFA and IRT techniques are needed to investigate the question of bias with regard to gender and/or race at the subtest/item level. Yet I have no explanation why there is such sex bias in the BW comparison only and especially why there is nothing like this in the NLSY97.

As a way of demonstration, I produce here the column numbers. Despite the almost perfect correlation of BW gap (0.9582), Black g (0.9969), White g (0.9931), and BW g (0.9947) vector with one another (i.e., “reliability”), we see that the g correlation with (d) gap is about 0.30 when not regressing out sex variable but while doing so would lead to a r(g*d) of only 0.05. More annoying is when we look at the individual numbers at each column. They were nearly all the same, with the only exception being Auto/Shop Information subtest for which the BW gap deviates by 0.2. This is exactly the same kind of problem we see earlier in the **meta-analysis** of Jensen effect in heritability and environmentality of cognitive (sub)tests. I will repeat here again. 10 subtests is way too low for a very interpretable MCV test. This is even more problematic in the face of the high reliability of g loadings and group (d) difference vectors, being respectively 0.86 and 0.78 (Jensen, **1998**, p. 383). In this way, correcting for vector (un)reliability and deviation from perfect construct validity is a pure waste of time. Such correction has more effect when the observed correlation moves away from zero. Here’s the difference :

(0.30/SQRT(0.86*0.78))/0.90=0.407

(0.05/SQRT(0.86*0.78))/0.90=0.068

Also, the above picture shows a very narrow distribution of g-loadings (0.090). If we assume a SDg of 0.128 as the population value (te Nijenhuis, **2007**, p. 288), we get 0.090/0.128=0.703, this finally yields 0.407/0.703=0.579. The correlation simply doubled when compared with the initial correlation of just 0.300. Obviously, the impact of these artifacts is very important and must always be taken into account when possible.

Anyway, the heritability vector’s reliability in contrast is certainly much lower, so that there is possibility to improve it (e.g., by using samples much larger than a few hundred). In the above picture, however, this is difficult because the vector correlations are very close to unity. The only thing to improve MCV is by using IQ batteries having much more than 10 subtests, which is extremely rare. In any case, we should be skeptical about any sex effect on the r(g*d). It was probably an anomaly.

When looking at the above numbers closely, however, we see that Auto/Shop Information subtest had one of the smallest g-loadings but also one of the most highest black-white difference. After removing it, the correlation jumps drastically, from about +0.100 to +0.500. This was true in the NLSY97 as well. This subtest (divided in two variables in NLSY97) alone was a strong moderator in the magnitude of r(d*g).

MGCFA test is needed to see whether or not this subtest is biased and therefore should be removed or not. As Dragt (**2010**) meta-analysis clearly shows, biased items/subtests can affect the magnitude of the correlations. Regardless, the **ASVAB website** does mention the following :

Myth: Some individual items on the ASVAB are biased against minorities.

The Truth: The ASVAB testing program routinely conducts statistical analyses of new test items to ensure that individual items are not biased against minorities. Items displaying evidence of bias are excluded from use on the ASVAB. In addition, sensitivity analyses are conducted on new ASVAB items to guard against including items that might be unintentionally viewed as biased against or insensitive toward a particular group. Experts who are trained to recognize item insensitivity review all new items and identify items with questionable content. Such items are either revised or excluded for use on the ASVAB.

Given this, we wouldn’t expect the ASVAB to be racially biased. Still, I provided the necessary data (EXCEL) for doing such MGCFA test (in **Amos** for instance) at the subtest level. Any evidence of intercept difference, or intercept bias, would mean that the actual racial gap cannot be entirely attributed to g, the other factor contributing to the difference being the differing levels of difficulty across groups (see Wicherts & Dolan, **2010**, for illustration). In MGCFA models, this would result in substantial decrement in model fit for intercept (scalar) invariance model relative to factor loading (metric) invariance model. Both must hold for measurement equivalence to be established.

Now when looking at the non-g loadings (PAF2) correlation with group differences, we see it hard to interpret. In the NLSY97, the black, hispanic and white PAF2 shows very strong correlation with racial gaps. In the NLSY79 however, the white PAF2, as well as hispanic and black PAF2 (when generated), always shows very large negative correlation with race differences.

Anyway, the correlation between g and group differences was unaffected by SES in both NLSY79 and NLSY97 for BW gap. Interestingly, regarding the significant HW gap correlation with g-loadings, it vanishes and becoming even negative in NLSY97 while remaining positive and significant in the NLSY79 after SES removed. One curious finding is definitely the BH gap. In the NLSY79, without controlling for SES, black-hispanic IQ difference shows no relation with g. In fact, such correlations were negative. After removing the influence of SES on all the ASVAB variables, the g*d correlation becomes slightly positive or near zero. At least, when not using weights because when applying sampling weight, the initial substantial negative correlation between g-loadings and BH gap decreases significantly after removing SES although it remains negative. In the NLSY97, BH gap and g-loading correlation without controlling for SES was about 0.11 or 0.14, but increases to 0.21 or 0.24 with SES regressed out from ASVAB variables.

I also provide data of racial gaps for both ASVAB-1981 and ASVAB-1999 for G-scores and nonG-scores, with and without controlling for parental SES. One particular feature is the BH gap, or black-hispanic gap. In both datasets, the gap increases after SES partialled out. We note the same thing happening at the subtest level, where the BH gap widens for all subtests when SES effect is removed. With regard to the widening black-hispanic gap when controlling for parental SES, the likely reason for this outcome is that hispanic parental education averages 1 or 2 years less than blacks, and their family income was about the same. At the same time, while controlling for SES reduces very little the black-white difference, it reduces the hispanic-white difference drastically. This could be compared with Jensen’s (**1973**, pp. 306-311) earlier analysis in which he compares blacks, whites and mexicans on PPVT (a caricature of culture loaded or biased test) and Raven (measuring essentially relation eduction, the purest form of Spearman’s g) scores. When equating for Raven, the mexicans scored below the blacks and blacks below the whites on PPVT. At the same time, when equating for PPVT score, blacks scored below whites, and hispanics scored very lightly above whites on Raven. Jensen interpreted this finding as to say that the mexican-white IQ difference was entirely due to socio-economic and/or cultural factors while the black-white IQ difference was due to a mix of genetic and environmental differences. The fact that hispanics were more ‘culturally’ deprived than blacks while scoring higher in cognitive tests is exactly what I was able to find in both NLSY79 and NLSY97. This is all the more interesting since g*d correlations between blacks and whites were not affected by SES but when it comes to hispanics (against either blacks or whites), SES may make a difference.

Now we also see that the BW (d) difference in g-score was about 1.60 SD in NLSY79 and 1.20 SD in NLSY97, suggesting a substantial decline. But this decline, when studied using the method of correlated vectors, has nothing to do with subtest g-loadings. Indeed, all comparison across racial groups (BW gap, HW gap, BH gap) showed substantial negative correlations with g, especially for BW gap. Generally, BW and HW (subtest) changes had positive signs, meaning that the gap is closing while not being g-loaded. These negative correlations were even stronger when using Jensen’s estimates (**1985**, Table 5) of either white g-loadings or black g-loadings of ASVAB.

When meta-analyzing Jensen’s collection (**1985**, Table 5) of data (total N=40850, total Harmonic N=14643), the meta-analytic correlation using White-g was 0.829 (11 studies) and using Black-g it was 0.786 (10 studies) when applied the correction for sampling error, g-loading range restriction, g vector unreliability, BW difference vector unreliability, and deviation from perfect construct validity.

Finally, in the NLSY97, I found a GPA variable (overall, english, foreign languages, math, social science, life sciences). The NLS investigator gives us this short introduction :

Credit weighted overall GPA. This variable indicates grade point averages across all courses on a 4 point grading scale. For each course, the quality grade (TRANS_CRS_GRADE.xxx) is weighted by Carnegie credits (TRANS_CRS_CARNEGIE_CREDIT.xxx). Quality grades were recoded as follows: 1 = 4.3, 2=4.0, 3=3.7, 4=3.3, 5=3.0, 6=2.7, 7=2.3, 8=2.0, 9=1.7, 10=1.3, 11=1.0, 12=0.7, 13=0.0, all other values recoded to missing. Please see Appendix 11 of the Codebook Supplement for more information on the collection and coding of transcript data.

I correlated all of these variables with each ASVAB subtests, yielding a column vector ‘ASVAB subtest correlates with GPA’ to which I correlate with ASVAB subtest vector g-loadings for each racial group separately. These correlations were very high (especially for blacks) at about 0.70 and 0.80, with two exceptions. When I regress out SES from my ASVAB variables, the correlation between subtest g-loadings and subtest correlations with GPA decreases somewhat but remain generally at about 0.50 and 0.70, with one exception. Blacks consistently have the highest g-loading*GPA correlations especially when using Spearman rho. When calculating the racial d gap in GPA scores, I noticed that the group differences as expected were much lower than in the ASVAB1999 scores. The BW difference in family income and parent’s education was also about 0.5 SD, half the difference in ASVAB.

To summarize, SES does not act as a moderate in the correlations between the g factor and black-white differences, there is no certainty that the gap reduction in ASVAB across cohorts was g-loaded for any group comparison, and there is a strong correlation between g-loadings and GPA scores studied in each racial groups separately.

]]>