← Back to Blog
Machine LearningInformation TheoryOptimization

Deriving the Cost Function for Classification Tasks

From the Kullback-Leibler divergence to the binary cross entropy

Aug 11, 2022·15 min read

Introduction

Classification tasks are one of the main problems in machine learning. Examples of this kind of task are: recognize dogs and cats in pictures, classify a sentence in "good" or "bad", etc. To solve this kind of problem using machine learning, we need to have a good dataset, choose an adequate machine learning model and then choose the right cost function. In this post we will focus on the cost function. More precisely, we will derive the cost function for classification tasks.

The Kullback-Leibler divergence and the cross entropy

To derive the cost function for classification tasks, we first have to talk about the Kullback-Leibler (KL) divergence, which is given by

DKL(PQ)=ExP[logP(x)Q(x)].D_{KL}(P||Q) = \mathrm{E}_{x\sim P}\left[\log{\frac{P(x)}{Q(x)}}\right].

We can write this as

DKL(PQ)=ExP[logP(x)logQ(x)],D_{KL}(P||Q) = \mathrm{E}_{x\sim P}\left[\log{P(x)} - \log{Q(x)}\right],

but we know that

ExP[f(x)]=xP(x)f(x),\mathrm{E}_{x\sim P}\left[f(x)\right] = \sum_{x}P(x)f(x),

then we have

DKL(PQ)=x[P(x)logP(x)P(x)logQ(x)].D_{KL}(P||Q) = \sum_{x}\left[P(x)\log{P(x)} - P(x)\log{Q(x)}\right].

The KL divergence tells us how different the distributions PP and QQ (over the same random variable) are. Knowing that DKLD_{KL} cannot be negative, we have a minimum when PP and QQ are the same distributions. To see that we just make Q(x)=P(x)Q(x) = P(x) for all xx and we then have:

DKL(PP)=x[P(x)logP(x)P(x)logP(x)]D_{KL}(P||P) = \sum_{x}\left[P(x)\log{P(x)} - P(x)\log{P(x)}\right] DKL(PQ)=0.D_{KL}(P||Q) = 0.

Another quantity that is more convenient to work with is the cross entropy between the distributions PP and QQ, given by

H(P,Q)=H(P)+DKL(PQ),H(P, Q) = H(P) + D_{KL}(P||Q),

where H(P)H(P) is the Shannon entropy

H(P)=ExP[logP(x)].H(P) = -\mathrm{E}_{x\sim P}\left[\log{P(x)}\right].

Using the form of H(P)H(P) in the cross entropy formula we have:

H(P,Q)=ExP[logP(x)]+ExP[logP(x)logQ(x)].H(P, Q) = -\mathrm{E}_{x\sim P}\left[\log{P(x)}\right] + \mathrm{E}_{x\sim P}\left[\log{P(x)}-\log{Q(x)}\right].

Using the linearity of the expectation function we have:

H(P,Q)=ExP[logP(x)]+ExP[logP(x)]ExP[logQ(x)].H(P, Q) = -\mathrm{E}_{x\sim P}\left[\log{P(x)}\right] + \mathrm{E}_{x\sim P}\left[\log{P(x)}\right] - \mathrm{E}_{x\sim P}\left[\log{Q(x)}\right].

Finally:

H(P,Q)=ExP[logQ(x)].H(P, Q) = - \mathrm{E}_{x\sim P}\left[\log{Q(x)}\right].

Observing the form of DKL(PQ)D_{KL}(P||Q) and H(P,Q)H(P,Q) we see that minimizing H(P,Q)H(P,Q) with respect to QQ is the same as minimizing DKL(PQ)D_{KL}(P||Q) with respect to QQ. From now we will concentrate ourselves on the term

ExP[logQ(x)],\mathrm{E}_{x\sim P}\left[\log{Q(x)}\right],

because maximizing it also means minimizing both the KL divergence and the cross entropy.

The maximum likelihood estimator

To define the maximum likelihood estimator we first admit that the distribution QQ is actually a familly of distributions parameterized by the parameter θ\theta. The problem is the reduced to find θ\theta that maximizes the quantity ExP[logQ(x)]\mathrm{E}_{x\sim P}\left[\log{Q(x)}\right]. More formally we have

θ=argmaxθExP[logQ(x;θ)].\theta^{*} = \arg\max_{\theta}\mathrm{E}_{x\sim P}\left[\log{Q(x; \theta)}\right].

This optimization problem says that we must find the parameters θ\theta that maximize the negative of the cross entropy. But there is a deeper problem here. To have good performance on maximizing this function we need to have a prior knowledge of the probability distribution PP so we can make a good guess for the distribution QQ. For example, if we knew that the distribution PP behaves like a Poisson distribuition, a good guess would be clearly a parameterized Poisson distribution.

Now all we have to do is rewrite this problem in the machine learning context. Suppose we have a train dataset that we want to use to train our model f(x;θ)f(x; \theta). Let's call the distribution generated by this dataset PtrainP_{train}. In the same way the distribution generated by the model is denoted as PmodelP_{model}. Suppose also that our train dataset is a collection of pairs (x,y)(x, y), where xx is the features and yy is the corresponding label. Because there is a conditional relation between xx and yy, we write the maximum likelihood estimator in the conditional form:

θ=argmaxθxPdata(yx)logPmodel(yx;θ).\theta^{*} = \arg\max_{\theta}\sum_{x}P_{data}(y|x)\log{P_{model}(y|x; \theta)}.

Using the maximum likelihood estimator to obtain the cost function for classification tasks

To get the cost function for classification tasks first we have to admit a few things. For simplicity we will consider that this problem consists of classifying an object (an image, a time series, a collection of features) in one of two available classes. For the second thing, it is pretty reasonable to admit that the true distribution, let's call it PP, will be something like a Bernoulli distribution. Because of that it is ok to choose PmodelP_{model} to be a parameterized Bernoulli distribution. Suppose now that the two available classes are AA and BB. We know that the set of features xx can be in one of these two classes. Thus we write the log likelihood as follows:

θ=argmaxθ[xAPtrain(yx)logPmodel(yx;θ)+xBPtrain(yx)logPmodel(yx;θ)].\theta^{*} = \arg\max_{\theta}\left[ \sum_{x \in A}P_{train}(y|x)\log{P_{model}(y|x; \theta)} + \sum_{x \in B}P_{train}(y|x)\log{P_{model}(y|x; \theta)}\right].

From the Bernoulli distribution we know that:

Ptrain(yxA)=qP_{train}(y|x \in A) = q Ptrain(yxB)=1qP_{train}(y|x \in B) = 1 - q Pmodel(yxA)=q^P_{model}(y|x \in A) = \hat{q} Pmodel(yxB)=1q^.P_{model}(y|x \in B) = 1 - \hat{q}.

Replacing these relations in the maximumn log likelihood we have:

θ=argmaxθ[xAqlogq^+xB(1q)log(1q^)].\theta^{*} = \arg\max_{\theta}\left[ \sum_{x \in A}q\log{\hat{q}} + \sum_{x \in B}(1 - q)\log{(1 - \hat{q})}\right].

Knowing that q^\hat{q} is the output of the neural network f(x;θ)f(x; \theta), we have:

θ=argmaxθ[xAqlogf(x;θ)+xB(1q)log(1f(x;θ))].\theta^{*} = \arg\max_{\theta}\left[ \sum_{x \in A}q\log{f(x; \theta)} + \sum_{x \in B}(1 - q)\log{(1 - f(x; \theta))}\right].

Because in this case summing over the subset is the same as summing over all possible values of xx, we have:

θ=argmaxθ[xqlogf(x;θ)+(1q)log(1f(x;θ))].\theta^{*} = \arg\max_{\theta}\left[ \sum_{x}q\log{f(x; \theta)} + (1 - q)\log{(1 - f(x; \theta))}\right].

Finding the maximum value of this expression is the same as finding the minimum value of the cross entropy:

H(Ptrain,Pmodel)=[xqlogf(x;θ)+(1q)log(1f(x;θ))].H(P_{train}, P_{model}) = -\left[ \sum_{x}q\log{f(x; \theta)} + (1 - q)\log{(1 - f(x; \theta))}\right].

This is the well known formula of the cost function for classification tasks. The data science community often calls it "binary cross entropy", but this is just the cross entropy between the data distribution and the Bernoulli distribution, in the same way that the mean squared error is the cross entropy between the data distribution and the normal distribution. Sometimes is more convenient to divide the cross entropy by the size of the train dataset. In this case we have:

L=1nH(Ptrain,Pmodel)=1n[xqlogf(x;θ)+(1q)log(1f(x;θ))].\textbf{L} = \frac{1}{n} H(P_{train}, P_{model}) = -\frac{1}{n} \left[ \sum_{x}q\log{f(x; \theta)} + (1 - q)\log{(1 - f(x; \theta))}\right].