Vectorized Operations
Gradients for Vectorized Code
Now that we’ve done all thee examples in the scalar case, we’re going to look at what happens when we have vectors.
If our variables x,y and z, instead of being numbers, we have vectors for these. And so everything stays exactly the same, our entire flow, except now our gradients are going to be Jacobian Matrices, these are now going to be matrices that contain the derivative of each element of, for example z, with respect to each element of x.
f(x) - max(0,x)
Input and outputs are both a 4096 dimensional matrix.
Jacobian matrix, each row is going to be partial derivatives, a matrix of partial derivatives of each dimension of the output with respect to each dimension of the input.
Q: What is the size of the Jacobian Matrix?A: 4096 x 4096. IN practice it’s going to be even larger because we’re going to be working with mini-batches of you know, for example, 100 inputs at the same time. We’ll put all of these through our node at the same time to be more efficient. This will scale the Jacobian Matrix by 100, so 409600 x 409600. This is really huge and completely impractiical to work with
So in practice, we don’t need to compute this huge Jacobian most of the time.
Q: What does this Jacobian Matrix look like?
A: We’re taking an elementwise maximum. And we think about what are each of the partial derivatives. Which dimension of the inputs affect which dimensions of the outputs. What sort of structure can we see in our Jacobian matrix? Because this is element-wise. Each element of the input only affects the corresponding element in the output , and so because of that, our Jacobian Matrix is just going to be a diagonal matrix.
So in practie we don’t have to write out and formulate this entire Jacobian. We just know the effect of x on the output, and we can just use these values and fill it in as we’re computing hte gradient.
A Vectorized Example
Concrete vectorized ex. Of a computational graph.
f(x,W) where f = the L2 of W*x
X is n dimensional and W is n2
Writing out the computational graph and filling out values for these inputs. We have W * x then being passed into an L2 Regularization node.
We can see that W is a [2x2] matrix and x is a 2 dimensional vector. If we label our intermediate node q as the output of the multiplication node, we have q = W*x.
Q is written in the image above as
[W1,1x1 + W1,2x2 + … + W1,nxn, Wn,1x1 + … + W]
We have f(q) = L2 Norm of q = ||q||2 = q21 + … + q2n
Let’s do backprop through this graph
Gradient with respect to L2 Node: The gradient of q^2 = 2q. Thus the gradient before the L2 Node is just 2 * q.
The gradient of a vector will be the same size as the original vector. Each element represents how much this particular element affects our final output of the function.
Gradient with respect to multiplication node.
https://youtu.be/d14TUNcbn1k?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&t=2782
The gradient of x with respect to qi is given by the above formula (why?)