Stochastic Gradient Descent (Vanilla)

Computing a loss for a training set with a very large number of examples N would be very expensive (for 100million examples IE) because we would be summing/averaging a massive amount of examples to find the true gradient..

Thus we employ stochastic gradient descent which samples small sets of training examples called minibatches at every iteration. Then we will use this minibatch to calculate an estimate of the full sum/true gradient.

This is considered stochastic because you can view it as maybe a Monte Carlo estimation of some expectation of the true value.

White true, sample some random minibatch of data, evaluate the loss/gradient on the minibatch, make an update on your parameters based on this update of the loss.ure.

This is vanilla SGD, but we can employ slightly fancier update rules which we will look at in the future which can optimize the process of gradient descent.

results matching ""

    No results matching ""