My grandfather has often said that he had two main lucky breaks in life: walking into the class at UCLA where he met my grandmother, and being pulled off the minefield on an Italian hillside when he was thrashing around after stepping on one mine, before he could step on another. In both cases, it was easy to see what he meant– if he’d been assigned to a different section of the sophomore Survey of English Literature class, or if the medic assigned to his unit the day he stepped on the mine had been less cool-headed– things would be very different. There, but for the grace of God, doesn’t go me.
Thinking about causality– about the reasons why things are the way they are, and how they might change or be changed in the future– is a deep and muddy hole, down which many a philosophical rabbit has jumped. But social scientists have often, in recent years, opted for a more comfortable briar patch: imagining what would be an ideal experiment to test a proposed potential factor or program or policy, and then approximating this ideal experiment– through statistical methods or by finding quasi-random processes in the real world, or by rounding up enough money to go out and assign one group or another to the proposed program or factor through actual random assignment.
“Causes are those things that could be treatments in hypothetical experiments,” the great statistician Don Rubin is said to say, and though this is clearly wrong at a semantic level– lots of things are described in ordinary language as causes even when it is almost impossible to imagine assigning them as treatments in an experiment– it is a good enough definition for many uses. If there is one core skill across graduate programs in the social sciences these days, it might be described as “identifying the counterfactual,” finding the version of my grandfather who didn’t walk into that classroom or get pulled off that mine, and checking up on how he is doing.
Describing the counterfactual in educational studies is especially difficult, in part because of the First Law of Educational Inefficacy: every educational setting, especially in a rich and institutionally argumentative country like the United States, is a sea of competing causes and initiatives. Somebody’s proposed policy or program or curriculum is being tried in every classroom and every lesson, even if more than half the time that same program is being ignored by all the principals or teachers or kids who are theoretically trying it out.
RCTs can, when well-executed, cut through some of this Gordian Knot– even if the kids, classrooms, and schools trying out an initiative are all variegated and many-splendored things, randomly assigning them to one program or another can make some of these differences, visible and invisible, average out, leaving the “true” impact of the intervention to be observed. But even a well-executed RCT requires a leap of faith that the differences observed are due to the treatment, and not to specific characteristics of the comparison group, and even well-intentioned researchers are often susceptible to incentives to occlude effects that don’t go in the right direction or don’t make it through the statistical significance (p<0.05) filter for publication.
This process is often described as “p-hacking,” though Andrew Gelman’s phrase “The Garden of Forking Paths” is perhaps more appropriate. Scott Alexander describes one example in this Slate Star Codex post about a similar issue in a “growth mindset” evaluation http://slatestarcodex.com/2015/04/22/growth-mindset-3-a-pox-on-growth-your-houses/ , where the reported effects from an RCT are only present when a very specific analytic technique is used.
There, but for the grace of Stata, who knows where we would be.