Introduction
Classification tasks are one of the main problems in machine learning. Examples of this kind of task are: recognize dogs and cats in pictures, classify a sentence in "good" or "bad", etc. To solve this kind of problem using machine learning, we need to have a good dataset, choose an adequate machine learning model and then choose the right cost function. In this post we will focus on the cost function. More precisely, we will derive the cost function for classification tasks.
The Kullback-Leibler divergence and the cross entropy
To derive the cost function for classification tasks, we first have to talk about the Kullback-Leibler (KL) divergence, which is given by
We can write this as
but we know that
then we have
The KL divergence tells us how different the distributions and (over the same random variable) are. Knowing that cannot be negative, we have a minimum when and are the same distributions. To see that we just make for all and we then have:
Another quantity that is more convenient to work with is the cross entropy between the distributions and , given by
where is the Shannon entropy
Using the form of in the cross entropy formula we have:
Using the linearity of the expectation function we have:
Finally:
Observing the form of and we see that minimizing with respect to is the same as minimizing with respect to . From now we will concentrate ourselves on the term
because maximizing it also means minimizing both the KL divergence and the cross entropy.
The maximum likelihood estimator
To define the maximum likelihood estimator we first admit that the distribution is actually a familly of distributions parameterized by the parameter . The problem is the reduced to find that maximizes the quantity . More formally we have
This optimization problem says that we must find the parameters that maximize the negative of the cross entropy. But there is a deeper problem here. To have good performance on maximizing this function we need to have a prior knowledge of the probability distribution so we can make a good guess for the distribution . For example, if we knew that the distribution behaves like a Poisson distribuition, a good guess would be clearly a parameterized Poisson distribution.
Now all we have to do is rewrite this problem in the machine learning context. Suppose we have a train dataset that we want to use to train our model . Let's call the distribution generated by this dataset . In the same way the distribution generated by the model is denoted as . Suppose also that our train dataset is a collection of pairs , where is the features and is the corresponding label. Because there is a conditional relation between and , we write the maximum likelihood estimator in the conditional form:
Using the maximum likelihood estimator to obtain the cost function for classification tasks
To get the cost function for classification tasks first we have to admit a few things. For simplicity we will consider that this problem consists of classifying an object (an image, a time series, a collection of features) in one of two available classes. For the second thing, it is pretty reasonable to admit that the true distribution, let's call it , will be something like a Bernoulli distribution. Because of that it is ok to choose to be a parameterized Bernoulli distribution. Suppose now that the two available classes are and . We know that the set of features can be in one of these two classes. Thus we write the log likelihood as follows:
From the Bernoulli distribution we know that:
Replacing these relations in the maximumn log likelihood we have:
Knowing that is the output of the neural network , we have:
Because in this case summing over the subset is the same as summing over all possible values of , we have:
Finding the maximum value of this expression is the same as finding the minimum value of the cross entropy:
This is the well known formula of the cost function for classification tasks. The data science community often calls it "binary cross entropy", but this is just the cross entropy between the data distribution and the Bernoulli distribution, in the same way that the mean squared error is the cross entropy between the data distribution and the normal distribution. Sometimes is more convenient to divide the cross entropy by the size of the train dataset. In this case we have: