Multiclass SVM Loss (Deep Dive)
http://cs231n.github.io/linear-classify/#svm
There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the Multiclass Support Vector Machine (SVM) loss. The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin Δ.
The fixed margin Δ is a hyperparameter.
Computing the loss for one training example:
Q: Why add +1 to the SVM loss function?
A: SVM loss does not care about the actual values of the scores and the losses. All it cares about is that the true value’s score is greater than the other scores by some arbitrary value. We can use any number for this value, but using +1 for this value allows this loss function to scale well.
The SVM loss function wants the score of the correct class yi to be larger than the incorrect class scores by at least by Δ (delta). If this is not the case, we will accumulate loss.
Averaging the losses for all training examples for the Total Loss:
Hinge Loss
A last piece of terminology we’ll mention before we finish with this section is that the threshold at zero max(0,−) function is often called the hinge loss.
You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form max(0,−)2 that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation.
Q/A - Multiclass SVM Loss
For the car/cat/frog example…
Q: What happens if we change the car scores just a little bit?
A: The loss should not change if the scores are jiggled a little bit, as long as the correct value’s score is higher than the other scores by a value of +1.
Q: What is the mix/max possible loss?
A: The min loss is 0. The max loss is infinity - (If you think back to the hinge loss plot, you can see that if the correct score goes infinitely negative, you can incur potentially infinite loss.)
Q: If all of your scores are so small that they are approximately 0, what kind of loss would you expect?
A: You would expect a loss of approximately (C-1) where C is the number of classes. This is because if you look at the equation for Multiclass SVM Loss, you will see that max(0, 0-0 + 1) evaluates to a loss of 1 for each class. If you sum this loss across all incorrect values (C-1 values) you will have a loss of approximately 1(C-1).This is actually useful during debugging, because during the first iteration of training, a weight matrix is initialized with small random values, resulting in a score vector with small uniform random values. Thus, if the value of your loss during the first iteration of training is not close to (C-1), it could be an indicator of a bug in your code.*
Q:What happens if you sum over all classes (including the correct class j=y_i)?
A: The total loss increases by 1, because the loss for the correct class =1.
So if you are including j=yi (which is the score for the correct prediction) , that means you’re adding to the total loss a value of max(0, syi - syi + 1) = 1
Q:What if we used a mean instead of sum?
A: Nothing would change overall, because with a mean, you’re just rescaling the entire loss function by a constant.
Q:What if we squared loss function? Would it remain the same problem or a different classification problem?
A: It would be different. Think back to the graph for the hinge loss. It’s a linear slope due to being a linear formula. If you squared the function, The graph would become a curve. Squaring the loss function can be used in situations where the penalty for mistakes are exponentially more severe. (With squared loss, things that are very very bad are now going to be squared bad - whereas if we’re using a hinge loss, the difference between a lot lot wrong and a little bit less wrong is the same as the difference between a bit wrong and a little bit less wrong.)
Q:If we find a weight matrix W which achieves zero loss, we can scale it by any factor and still achieve zero loss. How do we choose which version W to use if all achieve 0 loss?
A: First, we’ve only told our classifier to find the W which fits our training data. But in practice, we don’t care that much about fitting the training data - the whole point is that we use the training to find some classifier that performs well on test data. So how do we find the best weight matrix W that generalizes well on the test data?
Enter REGULARIZATION:
Regularization is a way to operationalize encouragement for the model to somehow pick a simpler W where the concept of simple depends on the task and the model. Following “Occam’s Razor” - if you have many different competing hypotheses, you should generally prefer the simpler one, because that explanation is more likely to generalize well in the future.
The way we operationalize this intuition in machine learning is through some explicit regularization penalty that’s often written down as λR.
As we talked about tuning hyperparameters and cross-validation in the last lecture, this regularization hyper-parameter λR will be one of the more important ones to tune when training models in practice.
Q:In the image above, what’s the connection between this λR and forcing the blue squiggly curve to better fit the data as a green line?
A: In this case, maybe the model has access to polynomials of very high degree, but through this regression term you could encourage the model to prefer polynomials of lower degree, if they fit the data properly. It’s a soft penalty that says - if you want to use a more complex model, you must overcome this specified penalty.
Another example of regularization in action is finding a proper tradeoff between the amount of features you want to use (model complexity) vs accuracy.