Where we are
- Scores: Expresses numerical value of Weights * input data.
- Li : Expresses the loss of all falsely classified values after one forward pass. Can be
- SVM Loss: Prefers that the true score is greater than the rest
- Softmax Loss:
- L : The total loss function with added regularization term that coerces the model towards a simpler solution.
We want to find the gradient of L. We would do this with gradient descent.
At every timestep we evaluate the gradient of the losses using either the:
- Numerical Gradient
- Slow :(, approximate :(, easy to write :)
- Analytical Gradient
- Fast :), exact :), error-prone :(
- In practice: Derive analytical gradient, check with Numerical Gradient.
So how do we compute the analytical gradient for arbitrarily complex functions?