The minimum required by a data scientist


Back to episodes

Listen now

September 27, 2016

by Francesco Gadaleta

Produced by: worldofpiggy.com

Support us

Did you like the show?
Please support us with a small donation. We will really appreciate!

In this episode I want to point out the minimum required by a Data Scientist in terms of knowledge and technical skills. I will also explain why I think the job of the data scientist can disappear sooner than we think and what should data scientists do in order to survive.

So you are a data scientist, right? Data scientists have the sexiest job in US and Europe, soon also in Asia if not already. But how many real Data Scientists are out there? There are a lot of statisticians, who rebranded themselves as Data Scientist, many applied mathematicians, and also a lot of bioinformaticians.

In this episode I want to point out the minimum required or… well expected from a Data Scientist in terms of knowledge and technical skills.

Technical skills

Very briefly a data scientist should master concepts like clustering, decision trees, regression, neural networks, principal component analysis, singular value decomposition, naive bayes classifiers, etc. If you want to listen to some episodes that are dedicated to a specific machine learning method, feel free to leave your request in the comments at worldofpiggy.com or itunes

This knowledge can be automatized very easily, making Data Scientist the most vulnerable job, not the sexiest at all. That’s why there are some skills that a Data Scientist must have if he/she doesn’t want to become a useless resource in any company that make use of predictive analytics.

Knowledge of Algorithm complexity is mandatory. As data increases in size a quadratic algorithm will be slow and infeasible. Linear or log linear algorithms should be considered. This usually makes things simple and fancy algorithms just cannot be applied. But at least the problem can be approached.

eg.

  1. P The complexity class of decision problems that can be solved on a deterministic Turing machine in polynomial time.
  2. NP The complexity class of decision problems that can be solved on a non-deterministic Turing machine in polynomial time.
  3. ZPP The complexity class of decision problems that can be solved with zero error on a probabilistic Turing machine in polynomial time.
  4. RP The complexity class of decision problems that can be solved with 1-sided error on a probabilistic Turing machine in polynomial time.
  5. BPP The complexity class of decision problems that can be solved with 2-sided error on a probabilistic Turing machine in polynomial time.
  6. BQP The complexity class of decision problems that can be solved with 2-sided error on a quantum Turing machine in polynomial time.

Coding

Programming skills in at least languages like Python, R, shell scripting for UNIX are essential. Somehow discussed by many, those who say that data scientists should not be great developers, some others who say that they should. I’m going to explain what I think and more importantly why do I think that data scientists should be quite advanced programmers and software engineers.

First of all

Analytics applied to small datasets is called statistics (merely what we could see some years ago with cohorts of some hundreds observations and/or survival analysis). Real time, streaming and big data analytics require more than pure statistics and that’s where things get big that optimization and elegance in coding really make the difference. Hence I think that coding is a great asset for the Data Scientist of the future.

Secondly

Inflation in data science as it was in academia. Everybody has a PhD today when a master was more than enough 10 years ago. Data scientists will be expected to know more and more as things get automatized. If you were great at random forest, now that random forest is automatically applied, a data scientist should also provide data collection skills and data cleaning and when all these will be automatically applied also infrastructure allocation skills. And who offers more wins the battle.

In a previous episode we mentioned that the future of data science will not be played around deep learning or any other fancy technology. But around data collection. Knowing what to collect is extremely important and most of the self-claimed data scientists decide to collect as much as they can, because - you know what? - we can deal with big data. Well no! Collecting the right data not only prevents from allocating resources that might be useless, but also helps simple algorithms to perform way better than the fancy stuff that few people know about.

I like to launch a provocation here. Data Scientist is the sexiest job today but not forever. Soon Data Scientists will be completely automatized.

Would you like a tip?

Listen to this episode again and focus on the human aspect of Data Scientist. This will help you keeping your job for a while, at least until you decide to retire.