In 2011, we got an ungodly amount of rain where I live. We also had a few rollicking windstorms, though not as many as the year before or the year after.
After the 2011 storms, though, we had a unusually high number of trees come down. When I went walking in the state park near my house, I saw healthy tree after healthy tree on its side. The roots had pulled out of the unusually muddy earth much more easily than if the soil had a normal amount of moisture. Wind mattered, and rain mattered, but what really mattered was the combination of wind and rain– an interaction effect.
Let’s imagine that the true “data generating procedure” for how many trees came down was the following:
Downed trees= 0.1*Wind +0.1*Rain+0.5*Wind*Rain(the Interaction Term)+ 0.3*Random Noise
Let’s pretend wind and rain are totally uncorrelated, normally distributed random variables. Interested in how wind is associated with our outcome, we sample 10,000 acres and get something like this:
Wind explains some of the variation in downed trees, but not everything; there’s clearly something else going on, and the more wind you have, the more those other things make a difference.
On the other hand, let’s say we only sample the acres that got a lot of rain, over 1 standard deviation above the mean; there are about 1600 acres that make the cutoff. Let’s see how those remaining acres vary with wind:
Now, wind explains almost all the variation in downed trees– for this restricted sample, wind and downed trees are correlated over 0.94.
When we observe a single variable or causal factor explaining almost all the variation in a dataset— whether downed trees, test scores, marriage rates, crime, or disease– it could be that the variable truly is the sole or fundamental determinant of the behavior or outcome, in all times and places.
Or it could be the true causal process involves the interaction of multiple variables, but the other variables are only observed, in our dataset, across a quite restricted range.