The “Goya Beans and Wildflowers” story I posted last week (true by the way) is my way of explaining the following ideas:
Idea #1) If we take behavioral genetics results even somewhat seriously, then VAM must have limited validity as a measure of teacher quality.
A) Behavioral genetics results indicate that the influence of genetics on cognitive skills *increases* as kids age. This is perhaps the most consistent single result in decades of behavioral genetics studies. :
In the context of current concerns about replication in psychological science, we describe 10 findings from behavioral genetic research that have replicated robustly. These are “big” findings, both in terms of effect size and potential impact on psychological science, such as linearly increasing heritability of intelligence from infancy (20%) through adulthood (60%).
B) The particular ages in which VAM is primarily used (8-13, and soon to be 8-17) are, in particular, ages where scientists observe a big increase in the percent of variance in cognitive ability due to genetics rather than environment. ( See figure 1 here )
C) VAM assumes the reverse– that you can “cancel out” the student-specific prior factors by controlling for prior achievement (pretest below) and student observed characteristics (eligibility for free lunch, ELL status, race, gender):
D) For this reason, existing VAM models mostly account for the fact that different groups, on average are likely to learn more or less in a given year (though they still often combine Asian and white students.) But the within-group variance in how much the kids in a classroom learn is assumed by the nature of the model to be a direct measure of how effective the kid’s teacher was that year. )
E) Insofar as anyone has looked at the heritability of individual student value-added, it appears to be just as heritable as baseline achievement, at around or over 50 percent. This is also true for relatively homogenous populations- it isn’t just an artifact of discrimination against individual groups.
Idea #2) Some of the weird stuff people observe about VAM could be explained by assuming Idea # 1 is true.
A) VAM doesn’t have very good within teacher reliability. For example, there is only a 0.35 correlation between NYC teachers’ VAM in one year and the same teacher teaching the same subject the following year .
B)In spite of the Gates Foundation investing hundreds of millions of dollars in encouraging teacher evaluation system adoption and in attempting to demonstrate the correlations among different measures of teacher quality (including VAM), their own data does not validate this. In settings in which students are randomized, VAM is only very weakly correlated with teacher quality as measured by multiple observers observing multiple lessons using a structured rubric : a teacher would have to move from the 4th to the 96th percentile in quality measured by observers in order to show a 0.06 SD increase in student achievement, equivalent to moving kids from the 50th to the 52nd percentile in achievement. Basically, going from the worst to the best teachers as measured by observers gives you a measureable but quite small increase in measured student achievement.
C) As Jesse Rothstein has repeatedly found, 5th grade VAM predicts 4th grade VAM for the same kids, in almost every dataset found.
D) Observational VAM score-associated impacts don’t fade out as fast as actual experimental impacts. Yes, I know people use this to say that teacher impacts are just super-duper important and class size or whatever isn’t. But the more reasonable interpretation is that experimental impacts fade out because that’s what impacts do. Why is VAM the only program “impact” that doesn’t show fade-out? Because it’s not an impact, it is measuring an underlying characteristic of the kids. Show me the component of VAM that does fade out, and you’ll be on your way to tracking down true teacher impacts.
So what gives?
Over time, people converge to their natural ability. And VAM is measuring a piece of that process of convergence. (This is especially true if you make the tests focused on abstract reasoning ability rather than school specific knowledge.) So if you by the luck of the draw get a bunch of kids who are going to converge upward, then your VAM is high— and those kids go on to earn a bunch more as adults.
I think this is all the difference between viewing the time series of human development as a random walk, buffeted by environmental shocks, and viewing it as having a unit root. In any biological system, there is going to be significant variation in the future growth course, even in the same environmental conditions, as long as there is significant genetic variation.
There are a very small number of studies that potentially avoid these issues and show teacher effects and some validity of VAM. For example, the Transferring Talented Teachers study found teachers who had high VAM in one school and paid them to go to another school, where students were randomly assigned to theirs or another class. The study found positive effects for elementary schools where the high-VAM teachers were transferred.
Note however that…
A) The teachers’ VA shrunk considerably after the teachers were transferred to a new group of kids. It didn’t shrink to zero, but it still shrunk.
B) The study is informative that teachers transferred from another school to a lower performing school (with trouble staffing its classrooms) are better than the teachers that remain in these troubled schools- not that high a bar to pass.
C) This same study only showed positive impacts for elementary teachers: there were zero/slightly negative impacts for middle school teachers in their new schools.
D)Other well-executed RCTs that would shed light on the validity of VAM generally have zero impacts. For example, you can search the total number of studies in “Teacher and Leader Effectiveness” that meet the federal governments’ standards for attrition and follow-up. There is a single study with “potentially promising” effects- a high-attrition RCT with an unusual pattern of zero initial impacts, positive long-term impacts. All the studies that met federal standards show zero impacts.
The overall conclusion is not that VAM has zero validity– it almost certainly has some, but that it has much lower validity than is claimed, and that the main determinant of how much kids learn in school from year to year is not teachers but the kids themselves. It’s not that teachers don’t matter– I’ve spent my life believing that teaching matters quite a bit, at least in terms of whether school is an interesting or pleasant place to be, and good teachers (my own, my children’s, and the colleagues I had over ten years) are the people I respect most in the world. But teachers are not the main determinant of test scores or earnings for the students they teach.