hypothesis vs data driven science

Science progresses in a dualistic fashion. You can either generate a new hypothesis out of existing data and conduct science in a data-driven way, or generate new data for an existing hypothesis and conduct science in a hypothesis-driven way. For instance, when Kepler was looking at the astronomical data sets to come up with his laws of planetary motion, he was doing data-driven science. When Einstein came up with his theory of General Relativity and asked experimenters to verify the theory’s prediction for the anomalous rate of precession of the perihelion of Mercury's orbit, he was doing hypothesis-driven science.

Similarly, technology can be problem-driven (the counterpart of “hypothesis-driven” in science) or tool-driven (the counterpart of “data-driven” in science). When you start with a problem, you look for what kind of (existing or not-yet-existing) tools you can throw at the problem, in what kind of a combination. (This is similar to thinking about what kind of experiments you can do to generate relevant data to support a hypothesis.) Conversely, when you start with a tool, you try to find a use case which you can deploy it at. (This is similar to starting off with a data set and digging around to see what kind of hypotheses you can extract out of it.) Tool-driven technology development is much more risky and stochastic. It is a taboo for most technology companies, since investors do not like random tinkering and prefer funding problems with high potential economic value and entrepreneurs who “know” what they are doing.

Of course, new tools allow you to ask new kind of questions to the existing data sets. Hence, problem-driven technology (by developing new tools) leads to more data-driven science. And this is exactly what is happening now, at a massive scale. With the development of cheap cloud computing (and storage) and deep learning algorithms, scientists are equipped with some very powerful tools to attack old data sets, especially in complex domains like biology.


Higher Levels of Serendipity

One great advantage of data-driven science is that it involves tinkering and “not really knowing what you are doing”. This leads to less biases and more serendipitous connections, and thereby to the discovery of more transformative ideas and hitherto unknown interesting patterns.

Hypothesis-driven science has a direction from the beginning. Hence surprises are hard to come by, unless you have exceptionally creative intuition capabilities. For instance, the theory of General Relativity was based on one such intuition leap by Einstein. (There has not been such a great leap since then. So it is extremely rare.) Quantum Mechanics on the other hand was literally forced by experimental data. It was so counter intuitive that people refused to believe it. All they could do is turn their intuition off and listen to the data.

Previously data sets were not huge, so scientists could literally eye ball them. Today this is no longer possible. That is why now scientists need computers, algorithms and statistical tools to help them decipher new patterns.

Governments do not give money to scientists so that they can tinker around and do whatever they want. So a scientist applying for a grant needs to know what he is doing. This forces everyone to be in a hypothesis-driven mode from the beginning and thereby leads to less transformative ideas in the long run. (Hat tip to Mehmet Toner for this point.)

Science and technology are polar opposite endeavors. Governments funding science like investors fund technology is a major mistake, and also an important reason why today some of the most exciting science is being done inside closed private companies rather than open academic communities.


Less Democratic Landscape

There is another good reason why the best scientists are leaving the academia. You need good quality data to do science within the data-driven paradigm, and since data is so easily monetizable the largest data sets are being generated by the private companies. So it is not surprising that the most cutting edge research in fields like AI is being done inside companies like Google and Facebook, which also provide the necessary compute power to play around with these data sets.

While hypotheses generation gets better when it is conducted in a decentralized open manner, the natural tendency of data is to be centralized under one roof where it can be harmonized and maintained consistently at a high quality. As they say, “data has gravity”. Once you pass certain critical thresholds, data starts generating strong positive feedback effects and thereby attracts even more data. That is why investors love it. Using smart data strategies, technology companies can build a moat around themselves and render their business models a lot more defensible.

In a typical private company, what data scientists do is to throw thousands of different neural networks at some massive internal data sets and simply observe which one gets the job done better. This of course is empiricism in its purest form, not any different than blindly screening millions of compounds during a drug development process. As they say, just throw it against a wall and see if it sticks.

This brings us to a major problem about big-data-driven science.


Lack of Deep Understanding

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Chris Anderson - The End of Theory

We can not understand the complex machine learning models we are building. In fact, we train them the same way one trains a dog. That is why they are called black-box models. For instance, when the stock market experiences a flash crash we blame the algorithms for getting into a stupid loop, but we never really understand why they do so.

Is there any problem with this state of affairs if these models get the job done, make good predictions and (even better) earn us money? Can not scientists adopt the same pragmatic attitude of technologists and focus on results only, and suffice with successful manipulation of nature and leave true understanding aside? Are not the data sizes already too huge for human comprehension anyway? Why do we expect machines to be able to explain their thought processes to us? Perhaps they are the beginnings of the formation of a higher level life form, and we should learn to trust them about the activities they are better at than us?

Perhaps we have been under an illusion all along and our analytical models have never really penetrated that deep in to the nature anyway?

Closed analytic solutions are nice, but they are applicable only for simple configurations of reality. At best, they are toy models of simple systems. Physicists have known for centuries that the three-body problem or three dimensional Navier Stokes do not afford a closed form analytic solutions. This is why all calculations about the movement of planets in our solar system or turbulence in a fluid are all performed by numerical methods using computers.

Carlos E. Perez - The Delusion of Infinite Precision Numbers

Is it a surprise that as our understanding gets more complete, our equations become harder to solve?

To illustrate this point of view, we can recall that as the equations of physics become more fundamental, they become more difficult to solve. Thus the two-body problem of gravity (that of the motion of a binary star) is simple in Newtonian theory, but unsolvable in an exact manner in Einstein’s Theory. One might imagine that if one day the equations of a totally unified field are written, even the one-body problem will no longer have an exact solution!

Laurent Nottale - The Relativity of All Things (Page 305)

It seems like the entire history of science is a progressive approximation to an immense computational complexity via increasingly sophisticated (but nevertheless quiet simplistic) analytical models. This trend obviously is not sustainable. At some point we should perhaps just stop theorizing and let the machines figure out the rest:

In new research accepted for publication in Chaos, they showed that improved predictions of chaotic systems like the Kuramoto-Sivashinsky equation become possible by hybridizing the data-driven, machine-learning approach and traditional model-based prediction. Ott sees this as a more likely avenue for improving weather prediction and similar efforts, since we don’t always have complete high-resolution data or perfect physical models. “What we should do is use the good knowledge that we have where we have it,” he said, “and if we have ignorance we should use the machine learning to fill in the gaps where the ignorance resides.”

Natalie Wolchover - Machine Learning’s ‘Amazing’ Ability to Predict Chaos

Statistical approaches like machine learning have often been criticized for being dumb. Noam Chomsky has been especially vocal about this:

You can also collect butterflies and make many observations. If you like butterflies, that's fine; but such work must not be confounded with research, which is concerned to discover explanatory principles.

- Noam Chomsky as quoted in Colorless Green Ideas Learn Furiously

But these criticisms are akin to calling reality itself dumb since what we feed into the statistical models are basically virtualized fragments of reality. Analytical models conjure up abstract epi-phenomena to explain phenomena, while statistical models use phenomena to explain phenomena and turn reality directly onto itself. (The reason why deep learning is so much more effective than its peers among machine learning models is because it is hierarchical, just like the reality is.)

This brings us to the old dichotomy between facts and theories.


Facts vs Theories

Long before the computer scientists came into the scene, there were prominent humanists (and historians) fiercely defending fact against theory.

The ultimate goal would be to grasp that everything in the realm of fact is already theory... Let us not seek for something beyond the phenomena - they themselves are the theory.

- Johann Wolfgang von Goethe

Reality possesses a pyramid-like hierarchical structure. It is governed from the top by a few deep high-level laws, and manifested in its utmost complexity at the lowest phenomenological level. This means that there are two strategies you can employ to model phenomena.

  • Seek the simple. Blow your brains out, discover some deep laws and run simulations that can be mapped back to phenomena.

  • Bend the complexity back onto itself. Labor hard to accumulate enough phenomenological data and let the machines do the rote work.

One approach is not inherently superior to the other, and both are hard in their own ways. Deep theories are hard to find, and good quality facts (data) are hard to collect and curate in large quantities. Similarly, a theory-driven (mathematical) simulation is cheap to set up but expensive to run, while a data-driven (computational) simulation (of the same phenomena) is cheap to run but expensive to set up. In other words, while a data-driven simulation is parsimonious in time, a theory-driven simulation is parsimonious in space. (Good computational models satisfy a dual version of Occam’s Razor. They are heavy in size, with millions of parameters, but light to run.)

Some people try mix the two philosophies, inject our causal models into the machines and enjoy the best of both worlds. I believe that this approach is fundamentally mistaken, even if it proves to be fruitful in the short-run. Rather than biasing the machines with our theories, we should just ask them to economize their own thought processes and thereby come up with their own internal causal models and theories. After all, abstraction is just a form of compression, and when we talk about causality we (in practice) mean causality as it fits into the human brain. In the actual universe, everything is completely interlinked with everything else, and causality diagrams are unfathomably complicated. Hence, we should be wary of pre-imposing our theories on machines whose intuitive powers will soon surpass ours.

Remember that, in biological evolution, the development of unconscious (intuitive) thought processes came before the development of conscious (rational) thought processes. It should be no different for the digital evolution.

Side Note: We suffered an AI winter for mistakenly trying to flip this order and asking machines to develop rational capabilities before developing intuitional capabilities. When a scientist comes up with hypothesis, it is a simple effable distillation of an unconscious intuition which is of ineffable, complex statistical form. In other words, it is always “statistics first”. Sometimes the progression from the statistical to the causal takes place out in the open among a community of scientists (as happened in the smoking-causes-cancer research), but more often it just takes place inside the mind of a single scientist.


Continuing Role of the Scientist

Mohammed AlQuraishi, a researcher who studies protein folding, wrote an essay exploring a recent development in his field: the creation of a machine-learning model that can predict protein folds far more accurately than human researchers. AlQuiraishi found himself lamenting the loss of theory over data, even as he sought to reconcile himself to it. “There’s far less prestige associated with conceptual papers or papers that provide some new analytical insight,” he said, in an interview. As machines make discovery faster, people may come to see theoreticians as extraneous, superfluous, and hopelessly behind the times. Knowledge about a particular area will be less treasured than expertise in the creation of machine-learning models that produce answers on that subject.

Jonathan Zittrain - The Hidden Costs of Automated Thinking

The role of scientists in the data-driven paradigm will obviously be different but not trivial. Today’s world-champions in chess are computer-human hybrids. We should expect the situation for science to be no different. AI is complementary to human intelligence and in some sense only amplifies the already existing IQ differences. After all, a machine-learning model is only as good as the intelligence of its creator.

He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.

- Leonardo da Vinci

Artificial intelligence (at least in its today’s form) is like a baby. Either it can be spoon-fed data or it gorges on everything. But, as we know, what makes great minds great is what they choose not to consume. This is where the scientists come in.

Deciding what experiments to conduct, what data sets to use are no trivial tasks. Choosing which portion of reality to “virtualize” is an important judgment call. Hence all data efforts are inevitably hypothesis-laden and therefore non-trivially involve the scientist.

For 30 years quantitative investing started with a hypothesis, says a quant investor. Investors would test it against historical data and make a judgment as to whether it would continue to be useful. Now the order has been reversed. “We start with the data and look for a hypothesis,” he says.

Humans are not out of the picture entirely. Their role is to pick and choose which data to feed into the machine. “You have to tell the algorithm what data to look at,” says the same investor. “If you apply a machine-learning algorithm to too large a dataset often it tends to revert to a very simple strategy, like momentum.”

The Economist - March of the Machines

True, each data generation effort is hypothesis-laden and each scientist comes with a unique set of biases generating a unique set of judgment calls, but at the level of the society, these biases get eventually washed out through (structured) randomization via sociological mechanisms and historical contingencies. In other words, unlike the individual, the society as a whole operates in a non-hypothesis-laden fashion, and eventually figures out the right angle. The role (and the responsibility) of the scientist (and the scientific institutions) is to cut the length of this search period as short as possible by simply being smart about it, in a fashion that is not too different from how enzymes speed up chemical reactions by lowering activation energy costs. (A scientist’s biases are actually his strengths since they implicitly contain lessons from eons of evolutionary learning. See the side note below.)

Side Note: There is this huge misunderstanding that evolution progresses via chance alone. Pure randomization is a sign of zero learning. Evolution on the other hand learns over time and embeds this knowledge in all complexity levels, ranging all the way from genetic to cultural forms. As the evolutionary entities become more complex, the search becomes smarter and the progress becomes faster. (This is how protein synthesis and folding happen incredibly fast within cells.) Only at the very beginning, in its most simplest form, does evolution try out everything blindly. (Physics is so successful because its entities are so stupid and comparatively much easier to model.) In other words, the commonly raised argument against the possibility of evolution achieving so much based on pure chance alone is correct. As mathematician Gregory Chaitin points out, “real evolution is not at all ergodic, since the space of all possible designs is much too immense for exhaustive search”.

Another venue where the scientists keep playing an important role is in transferring knowledge from one domain to another. Remember that there are two ways of solving hard problems: Diving into the vertical (technical) depths and venturing across horizontal (analogical) spaces. Machines are horrible at venturing horizontally precisely because they do not get to the gist of things. (This was the criticism of Noam Chomsky quoted above.)

Deep learning is kind of a turbocharged version of memorization. If you can memorize all that you need to know, that’s fine. But if you need to generalize to unusual circumstances, it’s not very good. Our view is that a lot of the field is selling a single hammer as if everything around it is a nail. People are trying to take deep learning, which is a perfectly fine tool, and use it for everything, which is perfectly inappropriate.

- Gary Marcus as quoted in Warning of an AI Winter


Trends Come and Go

Generally speaking, there is always a greater appetite for digging deeper for data when there is a dearth of ideas. (Extraction becomes more expensive as you dig deeper, as in mining operations.) Hence, the current trend of data-driven science is partially due to the fact that scientists themselves have ran out of sensible falsifiable hypotheses. Once the hypothesis space becomes rich again, the pendulum will inevitably swing back. (Of course, who will be doing the exploration is another question. Perhaps it will be the machines, and we will be doing the dirty work of data collection for them.)

As mentioned before, data-driven science operates stochastically in a serendipitous fashion and hypothesis-driven science operates deterministically in a directed fashion. Nature on the other hand loves to use both stochasticity and determinism together, since optimal dynamics reside - as usual - somewhere in the middle. (That is why there are tons of natural examples of structured randomnesses such as Levy Flights etc.) Hence we should learn to appreciate the complementarity between data-drivenness and hypothesis-drivenness, and embrace the duality as a whole rather than trying to break it.


If you liked this post, you will also enjoy the older post Genius vs Wisdom where genius and wisdom are framed respectively as hypothesis-driven and data-driven concepts.