Jun
24
2008

“Correlation Is Enough.”

Wired takes a gander at the way scientific discovery is changing due to massive amounts of data aquisition and storage, Google style:

Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the “beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

Now biology is heading in the same direction. The models we were taught in school about “dominant” and “recessive” genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton’s laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.

In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

If the words “discover a new species” call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn’t know what they look like, how they live, or much of anything else about their morphology. He doesn’t even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

If the assumptions here are true - that data no longer needs to be modeled but can just be thrown en masse into storage and then let loose the engines of algorithmic analysis to find patterns -what does it mean for the way we discover information? Will we become more accurate or less? Will proficiency increase while specificity decreases? (It sounds like it.)

And taking the question to the next, more mundane practical level - how the heck are they going to access all this data on magnetic storage arrays? Solid state devices are, I’m given to believe, faster and more capable of random access, if quite a bit more pricey. But when you’re talking petabytes of data, it could take stoopid-long to scan through the files. I’ve only got 650 GB of HD space at home, and when I search for a file it runs for a good 10 or 15 minutes if the location has never been indexed.

These are the questions I ask myself on days when I promise less posting. I then go on to post them. Unintentional irony is a mainstay in my life.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Technorati
Written by Steve Skojec in: Geek Stuff |

No Comments »

RSS feed for comments on this post. TrackBack URL


Leave a Reply

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com