Softmax Classification (Multinomial Logistic Regression)

Softmax loss is actually a bit more common than SVM loss in the context of Deep Learning.

With the Multiclass SVM loss, the classification spits out these 10 (or #classes C) numbers as “scores” for each class (for one training example), and Multi-Class SVM only cares that the “true score” is higher than the others by some margin Δ. There is no interpretation assigned to these scores.

With Softmax, we endow these calculated scores with additional meaning. In particular, we’re going to use those scores to compute a probability distribution over our classes (for one training example). We use this softmax function to normalize our scores:

So we:

take all our scores,
exponentiate them so that they become positive
Renormalize them by the sum of those exponents

Now after we send the original scores matrix through the softmax function, we end up this probability distribution, where we have probabilities over our classes. Each probability is between 0 and 1, and the sum of these probabilities = 1.So there’s that^ computed probability distribution that’s implied by our scores, and we want to compare the computed probability distribution with the target “true” probability distribution. If we know this image is a cat, then the “true” probability distribution would put all of the probability mass on cat [P(cat) = 1], and every other probability/class score would be 0.

So what we want to do is encourage the computed probability distribution to match the target probability distribution as closely as possible.

There are many ways to calculate this difference: K-L Divergence, Maximum Likelihood Estimation (MLE)But at the end of the day, we want the probability of the true class to be as close to 1 as possible.

Softmax Score Normalization and Loss Function

Q: How do we derive this softmax loss function L_i ?

So we ideally want the true class probability to be close to one.
Log is a monotonic function and it looks like this.
In practice it’s easier to maximize log than it is to maximize the raw probability, so we stick with log. [Src]
So if we maximize log(P(Correct Class)) that means we want that to be high
But loss functions measure “badness”, not “goodness” so we multiply by -1 to make the function go negative.
So as we are trying to maximize the log likelihood for the classifier, we are just the same trying to minimize the negative log likelihood (the loss) for the model.

Cat Example (Again)

So we take our scores matrix, exponentiate (so they’re positive) then normalize it (so it can sum to 1) to calculate our softmax probability. Then finally, our loss is the negative log, of that true softmax score.

Q/A - Softmax Loss

For the car/cat/frog example…

Q: Usually at initialization W is small so all s = 0. What is the loss?

A: The loss is -log(1/C) -> log(C). This can be used as a debugging tool: if your loss is not log(C) at the first iteration, then something has gone wrong.

Q: What is the mix/max possible loss?

A: The min loss is 0. The max loss is infinity - (0 probability mass on the correct class which never actually occurs, only approaches as a limit)

Q: With softmax, what happens if we jiggle the scores just a little bit?

A: Even if you are already giving the correct class a much higher score than the rest, softmax will want you to push the probability distribution of the correct class more and more towards infinity. So Softmax will pile more and more of the probability mass onto the correct class.

Softmax Classification (Multinomial Logistic Regression)