“Hey, why not use Machine Learning?”

I often got this suggestion from managers or other colleages.

Hey, why not use machine learning?

The first time I heard of this suggestion, I would reply right off , ” for what? why?”, which obviously upset a lot of them. Gradually, I realize that a lot of people, who may not fully understand my work, will give a lot of ideas based on what they hear or see elsewhere.

So naturally,  as machine learning and AI are so hot,  they would wonder, ” why can’t we use machine learning algorithms for our DNA sequencing work?”

First of all, I am not using machine learning at all for all of my work. My current toolbox is composed purely of all statistic models to do the estimation and detection.

Second, I am not planing to use machine learning models to replace the existing models in the near future either. Because, my work doesn’t fit for machine learning. But, that is what I think.

Ironically, with this culture, I am at an awkward position to defend myself, I have to prove that —

Statistical methods work better!

To begin with,  what is the difference between statistics and machine learning?

There have been a lot of  comparisons between the two methods, like the famous paper  Statistical Modeling: The Two Cultures. Basically,  statistical model is more about “inference”. It tries to understand and model the underlying data generating process. While machine learning is more about prediction. It treats the data generating process as unknown and uses the existing observations to predict the future events.

A summary of the major difference between statistics and model learning is listed in the following table, excerpt from “Statistics for Machine Learning” .

machinelearningvsStatis

Now, come to the question that why I much prefer statistical models. First of all, the sequence generation process is absolutely known process and can be well modeled in our work.  Though some weird cases can show up, the deviation can still be understood quite well. It paves the road for statistical modeling.

While machine learning may achieve the same level of accuracy and requires much less efforts on modeling, it can not guarantee the consistency  and hard to deal with the deviations.

Though I don’t want to go to the extreme to say that machine learning is definitely not suitable,  I don’t believe it will win against my current models.

However, statistical models are not mighty. One headache is that it lacks flexibility.  Every data point which can not be modeled, has to be restudied and retreated.  That consumes almost 90% of my work time. Thus, I am considering to bring machine learning algorithm for those cases.

Above all are all logic reasons. The truth, which I will never mention, is that our data organization is a total mess…..