Introduction

Classification tasks are one of the main problems in machine learning. Examples of this kind of task are: recognize dogs and cats in pictures, classify a sentence in "good" or "bad", etc. To solve this kind of problem using machine learning, we need to have a good dataset, choose an adequate machine learning model and then choose the right cost function. In this post we will focus on the cost function. More precisely, we will derive the cost function for classification tasks.

The Kullback-Leibler divergence and the cross entropy

To derive the cost function for classification tasks, we first have to talk about the Kullback-Leibler (KL) divergence, which is given by

D_{KL}(P||Q) = \mathrm{E}_{x\sim P}\left[\log{\frac{P(x)}{Q(x)}}\right].

We can write this as

D_{KL}(P||Q) = \mathrm{E}_{x\sim P}\left[\log{P(x)} - \log{Q(x)}\right],

but we know that

\mathrm{E}_{x\sim P}\left[f(x)\right] = \sum_{x}P(x)f(x),

then we have

D_{KL}(P||Q) = \sum_{x}\left[P(x)\log{P(x)} - P(x)\log{Q(x)}\right].

The KL divergence tells us how different the distributions $P$ and $Q$ (over the same random variable) are. Knowing that $D_{KL}$ cannot be negative, we have a minimum when $P$ and $Q$ are the same distributions. To see that we just make $Q(x) = P(x)$ for all $x$ and we then have:

D_{KL}(P||P) = \sum_{x}\left[P(x)\log{P(x)} - P(x)\log{P(x)}\right]

D_{KL}(P||Q) = 0.

Another quantity that is more convenient to work with is the cross entropy between the distributions $P$ and $Q$ , given by

H(P, Q) = H(P) + D_{KL}(P||Q),

where $H(P)$ is the Shannon entropy

H(P) = -\mathrm{E}_{x\sim P}\left[\log{P(x)}\right].

Using the form of $H(P)$ in the cross entropy formula we have:

H(P, Q) = -\mathrm{E}_{x\sim P}\left[\log{P(x)}\right] + \mathrm{E}_{x\sim P}\left[\log{P(x)}-\log{Q(x)}\right].

Using the linearity of the expectation function we have:

H(P, Q) = -\mathrm{E}_{x\sim P}\left[\log{P(x)}\right] + \mathrm{E}_{x\sim P}\left[\log{P(x)}\right] - \mathrm{E}_{x\sim P}\left[\log{Q(x)}\right].

Finally:

H(P, Q) = - \mathrm{E}_{x\sim P}\left[\log{Q(x)}\right].

Observing the form of $D_{KL}(P||Q)$ and $H(P,Q)$ we see that minimizing $H(P,Q)$ with respect to $Q$ is the same as minimizing $D_{KL}(P||Q)$ with respect to $Q$ . From now we will concentrate ourselves on the term

\mathrm{E}_{x\sim P}\left[\log{Q(x)}\right],

because maximizing it also means minimizing both the KL divergence and the cross entropy.

The maximum likelihood estimator

To define the maximum likelihood estimator we first admit that the distribution $Q$ is actually a familly of distributions parameterized by the parameter $\theta$ . The problem is the reduced to find $\theta$ that maximizes the quantity $\mathrm{E}_{x\sim P}\left[\log{Q(x)}\right]$ . More formally we have

\theta^{*} = \arg\max_{\theta}\mathrm{E}_{x\sim P}\left[\log{Q(x; \theta)}\right].

This optimization problem says that we must find the parameters $\theta$ that maximize the negative of the cross entropy. But there is a deeper problem here. To have good performance on maximizing this function we need to have a prior knowledge of the probability distribution $P$ so we can make a good guess for the distribution $Q$ . For example, if we knew that the distribution $P$ behaves like a Poisson distribuition, a good guess would be clearly a parameterized Poisson distribution.

Now all we have to do is rewrite this problem in the machine learning context. Suppose we have a train dataset that we want to use to train our model $f(x; \theta)$ . Let's call the distribution generated by this dataset $P_{train}$ . In the same way the distribution generated by the model is denoted as $P_{model}$ . Suppose also that our train dataset is a collection of pairs $(x, y)$ , where $x$ is the features and $y$ is the corresponding label. Because there is a conditional relation between $x$ and $y$ , we write the maximum likelihood estimator in the conditional form:

\theta^{*} = \arg\max_{\theta}\sum_{x}P_{data}(y|x)\log{P_{model}(y|x; \theta)}.

Using the maximum likelihood estimator to obtain the cost function for classification tasks

To get the cost function for classification tasks first we have to admit a few things. For simplicity we will consider that this problem consists of classifying an object (an image, a time series, a collection of features) in one of two available classes. For the second thing, it is pretty reasonable to admit that the true distribution, let's call it $P$ , will be something like a Bernoulli distribution. Because of that it is ok to choose $P_{model}$ to be a parameterized Bernoulli distribution. Suppose now that the two available classes are $A$ and $B$ . We know that the set of features $x$ can be in one of these two classes. Thus we write the log likelihood as follows:

\theta^{*} = \arg\max_{\theta}\left[ \sum_{x \in A}P_{train}(y|x)\log{P_{model}(y|x; \theta)} + \sum_{x \in B}P_{train}(y|x)\log{P_{model}(y|x; \theta)}\right].

From the Bernoulli distribution we know that:

P_{train}(y|x \in A) = q

P_{train}(y|x \in B) = 1 - q

P_{model}(y|x \in A) = \hat{q}

P_{model}(y|x \in B) = 1 - \hat{q}.

Replacing these relations in the maximumn log likelihood we have:

\theta^{*} = \arg\max_{\theta}\left[ \sum_{x \in A}q\log{\hat{q}} + \sum_{x \in B}(1 - q)\log{(1 - \hat{q})}\right].

Knowing that $\hat{q}$ is the output of the neural network $f(x; \theta)$ , we have:

\theta^{*} = \arg\max_{\theta}\left[ \sum_{x \in A}q\log{f(x; \theta)} + \sum_{x \in B}(1 - q)\log{(1 - f(x; \theta))}\right].

Because in this case summing over the subset is the same as summing over all possible values of $x$ , we have:

\theta^{*} = \arg\max_{\theta}\left[ \sum_{x}q\log{f(x; \theta)} + (1 - q)\log{(1 - f(x; \theta))}\right].

Finding the maximum value of this expression is the same as finding the minimum value of the cross entropy:

H(P_{train}, P_{model}) = -\left[ \sum_{x}q\log{f(x; \theta)} + (1 - q)\log{(1 - f(x; \theta))}\right].

This is the well known formula of the cost function for classification tasks. The data science community often calls it "binary cross entropy", but this is just the cross entropy between the data distribution and the Bernoulli distribution, in the same way that the mean squared error is the cross entropy between the data distribution and the normal distribution. Sometimes is more convenient to divide the cross entropy by the size of the train dataset. In this case we have:

\textbf{L} = \frac{1}{n} H(P_{train}, P_{model}) = -\frac{1}{n} \left[ \sum_{x}q\log{f(x; \theta)} + (1 - q)\log{(1 - f(x; \theta))}\right].

Deriving the Cost Function for Classification Tasks

Introduction

The Kullback-Leibler divergence and the cross entropy

The maximum likelihood estimator

Using the maximum likelihood estimator to obtain the cost function for classification tasks