Why Heritability Estimates are Flawed: A Conceptual Account
1200 words
Introduction
Heritability estimates have been used as a cornerstone and psychology and genetic research. They are designed to quantify the proportion of phenotypic variance in a population that can be attributed to genetic differences among individuals. We’ve known for a while now that heritability isn’t a measure of genetic strength (Moore and Shenk, 2016), but it’s a population-specific estimate of variance. Here I will provide two a priori arguments (one methodological on twins and the EEA and one theoretical based on Noble’s biological relativity argument). The twin critique shows that the twin researcher’s main assumption (equal environments) does not hold while the biological relativity critique shows that h2 is conceptually invalid. This is why there is a missing heritability problem—it never existed in the first place, and the assumptions twin researchers have are false.
The classical twin method
The CTM compares MZ and DZ twins to attempt to quantify the relative contributions of genes and environment in relation to the origin of trait differences between individuals. Perhaps the biggest assumption of the twin researcher is the equal environments assumption (EEA). The EEA assumes that MZ and DZ twins experience equivalent shared environments.
The EEA seems plausible enough: twins reared together should experience compatible environments, regardless of zygosity. But since DZ twins are more genetically similar than DZ twins so they’ll be more phenotypically similar as well. MZ twins are dressed alike, mistaken for one another, or placed in similar social roles compared to DZ twins which leads to more similar environments. So the shared environmental variance for MZ twins exceeds that for DZ twins, violating the EEA.
Clearly this violation throws a wrench into the logic of the CTM. The formula assumes that the greater similarity in MZ twins stems solely from their genetic identity. But if MZ twins experience more similar environments due to their phenotypic similarity (Fosse, Joseph and Richardson, 2015; Joseph et al, 2015), the difference in correlations between MZs and DZs captures genetic variance and excess environmental similarity. Thus, heritability is overestimated (see eg Bingley, Cappellari, and Tatsiramos, 2023) increasing the effect of genes while masking the effects of the environment—in effect, environment is made to look like genes. Thus, h2 fails to isolate genetic influence as intended. (Note that Grayson 1989 explains this as well, but it seems that it’s just ignored.) Here’s the argument:
(1) The classical twin method assumes that its heritability (h2) estimate (Falconer’s formula) isolates the proportion of phenotypic variance due solely to genetic variance.
(2) For the h2 estimate to isolate genetic variance, the shared environmental variance must be equal for MZ and DZ twins.
(3) MZ twins are more genetically similar than DZ twins.
(4) Genetic similarity between individuals leads to greater similarity in their expressed phenotypic traits, and this phenotypic similarity results in greater similarity in their environmental experiences.
(5) Because MZ twins have greater genetic similarity than DZ twins, and genetic similarity leads to phenotypic similarity, which in turn results in environmental similarity, the shared environmental variance is greater for MZ twins compared to DZ twins.
(6) If the shared environmental variance for MZ twins is greater than that for DZ twins, then the EEA is false because it requires that shared environmental variance be equal for both twin types.
(7) If the EEA is false, then we cannot logically infer genetic conclusions from h2, and thus h2 reflects shared environmental variance (c2), rather than genetic variance.
(8) Any method that relies on an assumption that’s logically inconsistent with the principles governing it’s variables – like the relationship between genetic similarity, environmental similarity and phenotypic similarity – cannot accurately isolate its intended causal component and is therefore conceptually untenable.
(9) Thus, the classical twin method is conceptually and logically untenable since it depends on the EEA which, when false, renders h2 a measure of environmental—not genetic—variance.
The biological relativity critique against h2
This argument is theoretical as opposed to methodological, and it relies on Noble’s (2012) biological relativity argument, where there is no privileged level of causation in biological systems. Genes, cells, tissues, organs, organisms, and the environment form an interdependent network where each level influences and is influenced by the other levels. Phenotypes arise from the interaction between all of these levels, not just due to the independent action of any one of the resources.
Heritability rests on a reductionist assumption—that phenotypic variance can be neatly partitioned into genetic and environmental components with genetic effects isolated as a distinct and quantifiable entity. This framework, clearly, privileges the genetic level treating it as separate from the broader biological and ecological context. But Noble’s argument directly contradicts this view. Genes don’t operate in a vacuum and do nothing on their own.
So by attempting to isolate genetic variance, heritability imposed an artificial simplicity on a complex reality (Rose, 2006). Noble’s principle suggests that separation isn’t just an approximation but a fundamental conceptual flaw. Phenotypic variation emerges from the integrated functioning of all biological levels, which then makes it impossible to assign causation to genes alone.
Thus, h2 is conceptually flawed, since it seeks to measure a genetic contribution that cannot be meaningfully disentangled from the holistic system in which it operates. Obviously the conceptual foundation of h2 contradicts the principle of biological relativity. Since h2 attempts to assign a specific portion of trait variance to genes alone, h2 implicitly privileges the genetic level, suggesting that it can be disentangled from the broader biological system. Noble’s argument denies that this is possible while emphasizing holism and rejecting reductionism. Thus, a priori, h2 estimates are fundamentally flawed because they rest on a reductionist framework which assumes a separability of causes which is incompatible with the holistic, relativistic nature of biological causation. Here’s the argument:
(1) Biological relativity holds that there is no privileged level of causation in biological systems: all levels (genes, cells, tissues, organs, organisms, environments) are interdependent in producing phenotypes.
(2) h2 assumes that genetic variance can be isolated and quantified as a distinct contributor to phenotypic variance.
(C) Since biological relativity rejects the isolation of genetic effects, h2 is conceptually invalid as a measure of genetic influence.
Conclusion
Both of these arguments show the same thing—h2 is a deeply flawed concept. The EEA critique exposes a methodological weakness: since MZ twins experience more similar environments than DZ twins, the excess environmental similarity experienced by MZs masquerades as genetic influence, leaving h2 incapable of isolating genetic variance.
But Noble’s biological relativity argument strikes at a deeper conceptual flaw in this practice, since it challenges the theoretical aspects of h2 itself. Since it highlights the interdependence of biological systems, it dismantles the reductionist notion that genetic effects can be separated from other levels of causation. The gene-centric assumption is at ends with the reality of phenotypes being emergent properties of multi-level interactions, which then renders the concept of h2 conceptually incoherent. Therefore, h2 isn’t only empirically questionable but it is theoretically untenable. The conceptual model is just not sound due to how genes really work (Burt and Simon, 2015)
Thus, again, hereditarianism fails conceptually. Even their main “tool” fails for a modicum of reasons not least the main theoretical killshot for heritability estimates—the principle of biological relativity. The reductionist hereditarian paradigm is conceptually and logically untenable, it’s time to throw it away, it’s time to throw it to the dustbin of history.
Sabermetrics > Psychometrics
1500 words
Introduction
Spring training is ramping up to prepare MLB players for the beginning of the season at the end of the month. (Go Yankees, Yankee fan for 30+ years.) To celebrate, I’m going to discuss sabermetrics and psychometrics and why sabermetrics > psychometrics. The gist is this: Sabermetrics and sabermetricians are actually measuring aspects of baseball performance (since there are observable physical events that occur, and then the sabermetricians think of what they want to measure and then use tangible values) while psychometricians aren’t measuring anything since there is no specified measured object, object of measurement and measurement unit for IQ or any psychological trait. I will mount the argument that sabermetricians are actually measuring aspects of baseball performance while psychometricians aren’t actually measuring aspects of human psychology.
Sabermetrics > psychometrics
Psychometrics is the so-called science of the mind. The psychometrician claims that they can measure the mind and specific attributes of individuals. But without a specified measured object, object of measurement and measurement unit for any psychological trait, then a science of the mind just isn’t possible. Psychometrics fails as true measurement since it doesn’t meet the basic requirements for measurement. When something physical is measured—like the length of a stick or a person’s weight—three things are needed: a clear object (a person or stick); a specific property (length or weight); and a standard unit (inches or kilograms). But unlike physical traits, mental traits aren’t directly observable and therefore, psychometricians just assume that they are measuring what they set out to. People think they because numbers are assigned to things, that psychometrics is measurement.
Sabermetrics was developed in the 1980s, pioneered by Bill James. The point of sabermetrics is to used advanced stats to analyze baseball performance to understand player performance and how a manager should build their team. We now have tools like Statcast where the exit velocity is measured once a player hits a ball, and we can also see the launch angle of the ball after it leaves the bat. It clearly focuses on measurable, tangible events which can then be evaluated more in depth when we want to understand more about a certain player.
For instance, take OBP, SLG, and OPS.
OBP (on-base percentage) is the frequency by which a player reaches base. This could be due to getting a hit, drawing a walk or being hit by a pitch. The OBP formula is: OBP = hits + walks + hit by pitch / at bats + walks + hit by pitch + sac flies. While batting average (BA) tells us how often we would expect a hitter to get a hit during a plate appearance, OBP incorporates walks which are of course important for scoring opportunities.
SLG (slugging) measures the total bases a player earns per at bat, while giving extra weight to a double, triple and homerun. SLG shows how well a batter can hit for extra bases, which is basically an aspect of their batting power. (That’s is also isolated power or ISO which is SLG – BA.) The formula for SLG is total bases / at bats.
OPS (on-base plus slugging) is a sum of OBP and SLG. It combines a player’s ability to get on base with their power through their SLG. There is also OPS+ which takes into account the ballpark’s dimensions and the altitude of the stadium to compare players without variables that would influence their performance in either direction.
When it comes to balls and strikes there is a subjective element there since different umpires have different strike zones and therefore, one umpire’s strike zone will be different from another’s. However, the MLB is actually testing an automated ball-strike system which would then take out subjectivity.
There is also wOBA (weighted on base average) which accounts for how a player got on base. Homeruns are weighted more than triples, doubles, or singles since they contribute fully to a run. Thus, wOBA is calculated from observable physical events. wOBA predicts run production and is testable against actual scoring.
We also have DRS (defensive runs saved) which attempts to quantify how many runs a particular defenders defense saved which takes into account the defender’s range of his throw, errors and double play ability. It basically is a measure of how many runs a defender cost or saved his team. So a SS who prevents 10 runs in a season has a DRS of +10. (This is similar to the ultimate zone rating—UZR—stat.) Both stats are derived from measurable physical events.
Each of the stats I discussed measure specific and countable actions which are verifiable through replay/Statcast which then tie directly to the game’s result (runs scored/prevented). Advanced baseball stats now have tools like Statcast which analyzes player and ball data during the game. Statcast takes out a lot of subjectivity in certain measurements, and it makes these measurements more reliable. Statcast captures things like exit velocity, launch angle, sprint speed and pitch spin rate. It can also track how far a ball is hit.
The argument that sabermetrics > psychometrics
(P1) If a field relies on quantifiable, observable data (physical events), then its analyses are more accurate.
(P2) If a field’s analyses are more accurate, then it is better for measurement.
(C) So if a field relies on quantifiable, observable data (physical events), then it is better for measurement.
Premise 1
Sabermetrics uses concrete numbers like hits, RBIs and homeruns. BA = hits / at bats, so a player who has 90 hits out of 300 at bats has a .300 average. When it comes to psychometrics mental traits cannot be observed/seen or counted like the physical events in baseball. So sabermetrics satisfies P1 since it relies on quantifiable, observable data while psychometrics fails since it’s data isn’t directly observable nor is it consistently quantifiable in a verifiable way. It should be noted that counting right or wrong answers on a test isn’t the same. A correct answer on a so-called intelligence test doesn’t directly measure intelligence, it’s supposedly a proxy which is influenced by test design and exposure to the items in question.
Premise 2
A player’s OBP can reliably indicate their runs scored contribution which can then be validated by the outcomes in the game. Psychometrics on the other hand has an issue here—one’s performance on a so-called psychometric test can be influenced by time or test type. So sabermetrics satisfies P2, since it’s accurate analyses enhance its measurement strength while psychometrics does not less accurate analyses along with not having the basic requirements for measurement then mean that it’s not measurement proper, at all.
Conclusion
Sabermetrics relies on quantifiable, observable data (P1 is true), and this leads to accurate analyses making it better for measurement (P2 is true), so sabermetrics > psychometrics since there are actual, quantifiable, observable physical events to be measured and analyzed by sabermetricians while the same is not true for psychometrics.
Since only counting and measurement qualify for quantification because they provide meaningful representations of quantities, then sabermetrics excels as a true quantitative field by directly rallying observable physical events. The numbers used in sabermetrics reflect real physical events and not interpretations. Batting average and on-base percentage are calculated directly from counts without introducing arbitrary scaling, meaning that a clear link to the original quantifiable events are maintained.
Conclusion
Rooted in data and observable, physical events, sabermetrics comes out the clear winner in this comparison. Fields that use quantifiable, observable evidence yield better, clearer insights and these insights then allow a field to gauge its subject accurately. This clearly encompasses sabermetrics. The data used in sabermetrics are based on quantifiable, observable data (physical events).
On the other hand, psychometrics fails where sabermetrics flourishes. Psychometrics lacks observable, quantifiable substance that true measurement demands. There is no specified measured object, object of measurement and measurement unit for IQ or any psychological trait. Therefore, psychometrics can’t satisfy the premises in the argument that I have constructed.
Basically, psychometricians render “mere application of number systems to objects” (Garrison, 2004: 63). Therefore, there is an illusion of measurement for psychometrics. The psychometrician claims they can assess abstract constructs that cannot be directly observed while also using indirect proxies like answers to test questions—which are not the trait themselves. There is no standardized unit in psychometrics and, for example for IQ, not true “0” point. Psychometricians order people from high to low, without using true countable units.
If there is physical event analysis then there is quantifiable data. If there is quantifiable data, then there is better measurement. So if there is physical event analysis, then there is better measurement. Thus, if there is no physical event analysis, then there is no measurement. It’s clear which field holds for each premise. The mere fact that baseball is a physical event and we can then count and average out certain aspects of player performance means that sabermetrics is true measurement (since there is a specified measured object, object of measurement and measurement unit) while psychometrics isn’t (no specified measured object, object of measurement and measurement unit).
Thus, sabermetrics > psychometrics.
Gould’s Argument Against the “General Factor of Intelligence”
2050 words
Introduction
In his 1981 book The Mismeasure of Man, Stephen Jay Gould mounted a long, historical argument, against scientific racism and eugenics. A key point to the book was arguing against the so-called “general factor of intelligence” (GFI). Gould argued that the GFI was a mere reification—an abstraction treated as a concrete entity. In this article, I will formalize Gould’s argument from the book (that g is a mere statistical abstraction), and that we, therefore, should reject the GFI. Gould’s argument is one of ontology—basically what g is or isn’t. I have already touched on Gould’s argument before, but this will be a more systematic approach in actually formalizing the argument and defending the premises.
Spearman’s g was falsified soon after he proposed it. Jensen’s g is an unfalsifiable tautology, a circular construct where test performance defines intelligence and intelligence explains performance. Geary’s g rests on an identity claim: that g is identical to mitochondrial functioning and can be localized to ATP, but it lacks causal clarity and direct measurability to elevate it beyond a mere correlation to a real, biologically-grounded entity.
Gould’s argument against the GFI
In Mismeasure, Gould attacked historical hereditarian figures as reifying intelligence as a unitary, measurable entity. Mainly attacking Spearman’s Burt, Gould argued that since Spearman saw positive correlations between tests that, therefore, there must be a GFI to explain test intercorrelations. Spearman’s GFI is the first principle component (PC1), which Jensen redefined to be g. (We also know that Spearman saw what he wanted to see in his data; Schlinger, 2003.) Here is Gould’s (1981: 252) argument against the GFI:
Causal reasons lie behind the positive correlations of most mental tests. But what reasons? We cannot infer the reasons from a strong first principal component any more than we can induce the cause of a single correlation coefficient from its magnitude. We cannot reify g as a “thing” unless we have convincing, independent information beyond the fact of correlation itself.
Using modus tollens, the argument is:
(P1) If g is a real, biologically-grounded entity, then it should be directly observable or measurable independently of statistical correlations in test performance.
(P2) But g is not directly observable or measurable as a distinct entity in the brain or elsewhere; it is only inferred from factor analysis of test scores.
(C) So g is not a real biologically-grounded entity—it is a reification, an abstraction mistaken for a concrete reality.
(P1) A real entity needs a clear, standalone existence—not just a shadow in data.
(P2) g lacks this standalone evidence, it’s tied to correlations.
(C) So g isn’t real; it’s reified.
Hereditarians treat g as quantifiable brainstuff. That is, they assume that it can already be measured. For g to be more than a statistical artifact, it would need to have an independent, standalone existence—like an actual physical trait—and not merely just be a statistical pattern in data. But Gould shows that no one has located where in the brain this occurs—despite even Jensen’s (1999) insistence about g being quantifiable brainstuff:
g…[is] a biological [property], a property of the brain
The ultimate arbiter among various “theories of intelligence” must be the physical properties of the brain itself. The current frontier of g research is the investigation of the anatomical and physiological features of the brain that cause g.
…psychometric g has many physical correlates…[and it] is a biological phenomenon.
Just like in Jensen’s infamous 1969 paper, he wrote that “We should not reify g as an entity…since it is only a hypothetical construct“, but then he contradicted himself 10 pages later writing that g (“intelligence”) “is a biological reality and not just a figment of social conventions.” However, here are the steps that Jensen uses to infer that g exists:
(1) If there is a general intelligence factor “g,” then it explains why people perform well on various cognitive tests.
(2) If “g” exists and explains test performance, the absence of “g” would mean that people do not perform well on these tests.
(3) We observe that people do perform well on various cognitive tests (i.e., test performance is generally positive).
(4) Therefore, since “g” would explain this positive test performance, we conclude that “g” exists.
Put another way, the argument is: If g exists then it explains test performance; we see test performance; therefore g exists. Quite obviously, it seems like logic wasn’t Jensen’s strong point.
But if g is reified as a unitary, measurable entity, then it must be a simple, indivisible capacity which uniformly underlies all cognitive abilities. So if g is a simple, indivisible capacity that uniform underlies all cognitive abilities, then it must be able to be expressed as a single, consistent property unaffected by the diversity of cognitive tasks. So if g is reified as a unitary, real entity, then it must be expressed as a single cognitive property unaffected by the diversity of cognitive tasks. But g cannot be expressed as a single, consistent property unaffected by the diversity of cognitive tasks, so g cannot be reified as a unitary, real entity. We know, a priori, that a real entity must have a nature that can be defined. Thus, if g is real then it needs to be everything (all abilities) and one thing (a conceptual impossibility). (Note that step 4 in my steps is the rectification that Gould warned about.) The fact of the matter is, the existence of g is circularly tied to the test—which is where P1 comes into play.
“Subtests within a battery of intelligence tests are included n the basis of them showing a substantial correlation with the test as a whole, and tests which do not show such correlations are excluded.” (Tyson, Jones, and Elcock, 2011: 67)
This quote shows the inherent circularity in defining intelligence from a hereditarian viewpoint. Since only subtests that correlate are chosen, there is a self-reinforcing loop, meaning that the intercorrelations merely reflect test design. Thus, the statistical analysis merely “sees” what is already built into the test which then creates a false impression of a unified general factor. So using factor analysis to show that a general factor arises is irrelevant—since it’s obviously engineered into the test. The claim that “intelligence is what IQ tests measure” (eg Van der Maas, Kan, and Borsboom, 2014) but the tests are constructed to CONFIRM a GFI. Thus, g isn’t a discovered truth, it’s a mere construct that was created due to how tests themselves are created. g emerges from IQ tests designed to produce correlated subtest scores, since we know that subtests are included on the basis of correlation. The engineering of this positive manifold creates g, not as a natural phenomenon, but as a human creation. Unlike real entities which exist independently of how we measure them, g’s existence hinges on test construction which then stripes it of its ontological autonomy.
One, certainly novel, view on the biology supposedly underlying g is Geary’s (2018, 2019, 2020, 2021) argument that mitochondrial functioning—specifically the role of mitochondrial functioning in producing ATP through oxidative phosphorylation—is the biological basis for g. Thus, since mitochondria fuel cellular processes including neuronal activity, Geary links that efficiency to cognitive performance across diverse tasks which then explains the positive manifold. But Geary relies on correlations between mitochondrial health and cognitive outcomes without causal evidence tying it to g. Furthermore, environmental factors like pollutants affect mitochondrial functioning which means that external influences—and not an intrinsic g—could drive the observed patterns. Moreover, Schubert and Hagemann (2020) showed that Geary’s hypothesis doesn’t hold under scrutiny. Again, g is inferred from correlational outcomes, and not observed independently. Since Geary identifies g with mitochondrial functioning, he assumes that the positive manifold reflects a single entity, namely ATP efficiency. Thus, without proving the identity, Geary reifies a correlation into a thing, which is what Gould warned about not doing. Geary also assumes that the positive manifold demands a biological cause, making it circular (much like Jensen’s g). My rejection of Geary’s hypothesis hinges on causality and identity—mitochondrial functioning just isn’t identical with the mythical g.
The ultimate claim I’m making here is that if psychometricians are actually measuring something, then it must be physical (going back to what Jensen argued about g having a biological basis and being a brain property). So if g is what psychometricians are measuring, then g must be a physical entity. But if g lacks a physical basis or the mental defies physical reduction, then psychometrics isn’t measuring anything real. This is indeed why psychometrics isn’t measurement and, therefore, why a science of the mind is impossible.
For something to exist as a real, biological entity, it must exhibit real verifiable properties, like hemoglobin and dopamine, and it must exhibit specific, verifiable properties: a well-defined structure or mechanism; a clear function; and causal powers that can be directly observed and measured independently of the tools used to detect it. Clearly, these hallmarks distinguish real entities from mere abstractions/statistical artifacts. As we have seen, g doesn’t meet the above criteria, so the claim that g is a biologically-grounded entity is philosophically untenable. Real biological entities have specific, delimited roles, like the role of hemoglobin in the transportation of oxygen. But g is proposed as a single, unified factor that explains ALL cognitive abilities. So the g concept is vague and lacks the specificity expected of real biological entities.
Hemoglobin can be measured in a blood sample but g can’t be directly observed or quantified outside of the statistical framework of IQ test correlations. Factor analysis derives g from patters of test performance, not from an independent biological substrate. Further, intelligence encompasses distinct abilities, as I have argued. g cannot coherently unify the multiplicity of what makes up intelligence, without sacrificing ontological precision. As I argued above, real entities maintain stable, specific identities—g’s elasticity, which is stretched to explain all cognition—undermines it’s claims to be a singular, real thing.
Now I can unpack the argument like this:
(P1) A concept is valid if, and only if, it corresponds to an independently verifiable reality.
(P2) If g corresponds to an independently verifiable reality, then it must be directly measurable or observable beyond the correlations of IQ test scores.
(P3) But g is not directly observable beyond the correlations of IQ test scores; it is constructed through the deliberate selection of subtests that correlate with the overall test.
(C1) Thus g does not correspond to an independently verifiable reality.
(C2) Thus, g is not a valid concept.
Conclusion
The so-called evidence that hereditarians have brought to the table to infer the existence of g for almost 100 years since Spearman clearly fails. Even after Spearman formulated it, it was quickly falsified (Heene, 2008). Even then, for the neuroreductionist who would try to argue that MRI or fMRI would show a biological basis to the GFI, they would run right into the empirical/logical arguments from Uttal’s anti-neuroreduction arguments.
g is not a real, measurable entity in the brain or biology but a reified abstraction shaped by methodological biases and statistical convenience. g lacks the ontological coherence and empirical support of real biological entities. Now, if g doesn’t exist—especially as an explanation for IQ test performance—then we need an explanation, and it can be found in social class.
(P1) If g doesn’t exist then psychometricians are showing other sources of variation.
(P2) The items on the test are class-dependent.
(P3) If psychometricians are showing other sources of variation and the items on the tests are class-dependent, then IQ score differences are mere surrogates for social class.
(C) Thus, if g doesn’t exist then IQ score differences are mere surrogates for social class.
We don’t need a mysterious factor to explain the intercorrelations. What does explain it is class—exposure to the item content of the test. We need to dispense with a GFI, since it’s conceptual incoherence and biological implausibility undermine it’s validity as a scientific construct. Thus, g will remain a myth. This is another thing that Gould got right in his book, along with his attack on Morton.
Gould was obviously right about the reification of g.