# The Mathematics Behind Naive Bayes Classifiers

More often than not, many among us try too hard when training machine learning models, usually opting for state of the art deep learning models, training small data sets on such complex models will definitely lead to over-fitting, deep learning was never intended to be used on a 100 input point data set. Keeping Occam’s razor in mind, In this article, we will take a step back and appreciate the power of simpler statistical models. Don’t let the apparent simplicity fool you, the craft of coming up with statistical models to solve machine learning problems given a small data set, is usually what separates good data scientists from the rest. Almost anyone can achieve high accuracy scores given a big enough data set.

## The Theory Behind The Naive Bayes classifier

The Naive Bayes classifier is a probabilistic classifier that is based on the Bayes’ Theorem with the assumptions that each feature makes an independent and an equal contribution to the outcome. To give an example of the independence assumption, the following analogy would suffice: If we roll a dice *n *times and we are interested in calculating the probability of getting *n* 3’s in a row. The probability will be calculated as: (1/6)ⁿ the rolls don’t impact the probability of the subsequent rolls. Regarding the equal contribution assumption, I would like to give an example from natural language processing. Let’s say that you are classifying SMS messages as spam or not spam; each word in our vector of words will constitute a feature. Given equal contribution, each word will have the same “weight” in determining whether an SMS message is to be classified as spam or not spam.

Bayes’ Theorem is mathematically formulated as the following:

In simple terms, our aim is to find the probability that an SMS is spam or not given that it contains a set of words(input data). The open form of the equation above takes the following form:

What we did above is known as the joint probability model; multiplying the probability that word *x₁ *appears given that the SMS is spam with the word *x₂ *and so on.. This, of course, is possible thanks to the independence assumption we made earlier. When calculating the probability, the denominator is of little importance, since we could regard it as a constant as it does not depend on *y.*

The classifier model can be expressed as the following:

Here we are simply looking for the class {0,1} that will maximize the probability of that class multiplied by the probabilities of the feature vectors given that that class.

In the upcoming article we will take a real life example from NLP and apply Naive Bayes using the scikit-learn library

## Conclusion

When confronted with any machine learning problem, it’s probably a good approach to try simpler algorithms and move on depending on your initial results. Don’t forget that even the most complex deep learning libraries have their roots in traditional statistical and machine learning models.