Big Data. Everybody knows what it is. Except that nobody actually does.

Big Data is a big thing indeed.The way companies and research institutes are approaching the problem of data analysis with respect to the trends of a few years ago is changing the entire scene of the industry, marketing and clinical research, just to name a few.

The very first need of companies is to switch from batch systems to algorithmic systems. That is from jobs done at regular intervals to real-time updates. If some years ago a marketing strategy consisted of analysing the spreadsheet of purchases once in a week, today this is done every minute or even more frequently. Moreover the amount of purchases, or, more generally speaking, the number of rows in the hypothetical table of a database has reached the millions, if not more.

Predictive analytics needs to be performed at a much finer resolution with respect to the aforementioned scenario. The main requirement of average-sized companies is to develop their own algorithmic capabilities, on custom infrastructure, dealing with data that are specific to their business. Ignoring these requirements and approaching the problem in the traditional way leads to being locked out of this very clear wave of growth.

According to a recent report of CDW, the most challenging tasks to accomplish within the Big data revolution are, from the most challenging to the least:

  1. Combining different kinds of data from different sources
  2. Managing volumes of data effectively
  3. Interoperability between technologies
  4. Defining the data that needs to be collected

They all seem trivial and still those are the most difficult ones that everybody is dealing with. Everything changes in the realm of Big data analytics.

A general approach from the industry to research, from selling books to clinical experiments, consists of detecting patterns by data analysis. This must be fast and within an experimental context. There is not so much time for theories which, quite often, fail due to the problem of dimensionality, variability of data, volatility of conditions and uncertainty of whatever model has been taken into consideration.

If the best imaginable solution used to be seen as a nice academic problem that required to be dealt with, today it can lead to keeping the project in standby, on the desk of your boss if not in his/her trash bin.
Perfectionists never deliver. That’s the problem.

There are, however, certain traps that most big data analysts on the market easily fall prey to. Here is a limited and non-exhaustive list that might help, to a certain extent, in eliminating the possibilities of failure.

 

Point 1

As with all trends, a lot of people want to join the revolution, sometimes having very little clue of what type of expertise it requires. One example on which I will stress a bit is confusing data science with statistics or worse, with machine learning and artificial intelligence. Sometimes it really looks like those terms are indistinguishable for many.

Point 2

Big data offer a higher chance for being analysed than traditional data. This can turn out to be an issue, because it makes the business objective less clear. Buying a shampoo can, in theory, be an easy task. But this is not necessarily the case when there are 150 different brands on the shelf, with different qualities, designed for 100 different types of teeth, flavors, mouths, and the like. The number of questions that people would like to ask when they have big data at their disposal can be enormous. Sometimes many of these questions are just ridiculous, regardless of their feasibility.

Point 3

Ignorance leads to failure which leads to waste of resources and money. Big data is usually associated with big infrastructure, big scientists, big profits, big algorithms (by which I mean complex). Not true. There is no one-to-one relation between these terms. What a big data project needs first is design. Design of the objectives to be achieved, design of the algorithms that should be considered – which should be simple enough to run in decent time and design of the infrastructure that will perform the aforementioned algorithms. Not considering this very simple scheme usually leads to poor (too complex) algorithms that will in turn lead to waste of resources. Computational complexity will never be reduced by new infrastructure (unless the problem is linear or polynomial). Remember that.

Point 4

Working with statisticians does not necessarily mean that they can help in the big data realm. Usually the opposite is true. These people might know a lot about distributions and hypothesis testing. (Un)fortunately, Big data analytics is rarely about that. It’s important not to confuse big data with statistics. An immediate issue that comes from mixing statisticians with data scientists is lack of communication. If one talks about p-values and multiple testing correction, the other speaks about approximations, Montecarlo, and the necessity to rethink the problem because the hardware is exploding.
If statistics is a good starting point, all the rest is done with learning from data, visualization skills, coding capabilities, algorithm design, knowledge of the infrastructure, knowledge of software architectures like Hadoop and Spark.

Did you really think that knowing (non-)parametric statistics was all it takes?

About these ads

About Piggy

I am Piggy and I spend my life reading about math and of course eating. I love science and I support my flatmate who provides me problems to solve and, well, food.
This entry was posted in General and tagged , , , , , . Bookmark the permalink.

2 Responses to Big Data. Everybody knows what it is. Except that nobody actually does.

  1. JB says:

    s/shampoo/toothpaste/ ;)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s