Tim Harford has written an excellent article about the pitfalls of “big data” analysis. It basically boils down to this:
Correlation isn’t causation, no matter how much data you have.
But the bigger your dataset, the more likely you are to find correlations within it.
To illustrate, here is a graph of 100 random walks, each with 20 points (a random walk equals its previous value plus a random number, which I’ve determined to be uniformly distributed between -0.5 and 0.5):
I’m actually quite surprised that Excel didn’t crash when I drew this chart, so maybe it doesn’t meet the definition of big data.
Anyway, these random walks have absolutely no relationship to each other — each is just the result of adding up 20 random numbers. But if you tested all 4,950 possible correlations between these 100 series, you would find a staggering 2,175 correlations (44%!) are statistically significant at the 95% level just by chance (yes, I did test them).
One solution to this mess is of course to have some good theory behind your analysis in the first place, and only test for correlations that make sense within that theory. This still doesn’t guarantee that you won’t find spurious correlations, but will put you in a much better place than blindly testing everything possible in your dataset.
So far from making theory redundant, I think big data makes it even more important. The nice thing about big data is that it potentially allows testing more intricate and nuanced theories than is possible with small datasets.