What is Data Science?

data_scienceData science is a term that conveniently conveys a collection of methods for dealing with data. However, it is a confusing term in that data science is not necessarily a science—perhaps only rarely so—although it should form a portion of the analytical toolbox of scientists and others that interpret data. As more and more data are collected from an expanding number of sources, the promise of careers in data science and the rapid expansion of solutions to societal problems at the hands of these new data scientists is being touted. The potential for many new discoveries is real. However, data science is not a shortcut around established mathematics and statistics.

The techniques that are encompassed by data science fall into the categories of obtaining data, coding and wrangling data into a format that is useful for analysis, interpreting the trends in the data, and presenting the data and trends. These procedures have roots in experimental design, algebra, statistics, and calculus. Data science can be learned independently of these subjects; however, it cannot be practiced well without them. The difference between trend analysis and data dredging is an understanding of variation and probability from statistics and having an a priori hypothesis. A randomly-generated set of points can be used as vertices to draw a new picture. However, finding the face of Elvis in black mold in your refrigerator does not make fungus divine.

Science is a process of seeking truth by making alternative hypotheses to explain a given phenomenon and then attempting to disprove them one at a time. It never involves proving a hypothesis. Repeat that one with me:  science never involves proving a hypothesis. Each alternative hypothesis for the phenomenon should be testable and falsifiable. If you are not testing falsifiable hypotheses, you are not doing science. In seeking to falsify hypotheses data science can represent a powerful analytical toolbox that complements very well the more traditional statistics, for example. However, I am not aware of any hypotheses within data science per se being tested. This is why data science is not actually a science. Buying a microscope does not make you a scientist either, but many scientists do make use of them. Are there phenomena inherent to data that could be tested? It is unclear to me how such phenomena could be separated from the subject of the data, but if there are, and they form the subject of a research program, I will then be forced to admit that data science is a science. In the meantime, it is a very useful set of skills that I wish I had had earlier in life—like in my undergraduate years.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s