Backpropagation

How does backpropagation work?

Backprop: A Simple example:

In this example, we want to find the gradients of the output of the function with respect to any of the variables.

The first step is to take our function and represent it as a computational graph as shown on the top right.Then we perform a forward pass through the network/computational graph.

Here we want to give every intermediate variable a name. In this example we will call the sum (x+y) = q and the multiplication q*z = f. We will also compute the gradients/partial derivatives for each intermediate variable (q and f) as shown below.

So what we want to find is the gradients of f with respect to x,y and z.

What backprop isis a recursive application of the chain rule. We’ll start at the end of the

computational graph, then we’ll work back and compute all our gradients along the way.

In the above image, we are calculating the gradient of f with respect to f, df/df = 1.

In this image, let’s analyze the gradients that have been calculated for each variable.

df/df:
- As we talked about previously, df/df = 1
df/dq:
- As shown by the equation for df/dq in the blue box, df/dq = z, which is shown to have a value of z=-4. Thus df/dq = -4
df/dz:
- Again, as shown by the equation for df/dz in the blue box, df/dz = q, which is shown to have a value of z=3. Thus df/dz = -3
df/dx:
- Here, we apply the chain rule to calculate the total effect of dx on f. This equation can be analogized as the total effect of q on f** multiplied by the total effect of x on q.
- This behavior can be observed by looking at the computational graph and seeing the effect of variable x on q, which affects f.
- Since df/dx = df/dq * dq/dx, as df/dq = -4 and dq/dx = 1, the value of df/dx = -4
df/dy:
- Just as we calculated df/dx using the chain rule, we do the same here.
- df/dy = df/dq dq/dy = -4 1. Thus df/dy = -4

Backprop in a single neuron

So each node gets local inputs coming in and the output it sees as directly passing onto the next node.

We also have these local gradients that we computed, the gradient of the immediate output of the node with respect to the inputs coming in (which define the partial derivative of the local output with respect to the local inputs. )

The beauty of backprop in a neural network is that each neuron only sees and knows these local inputs and outputs.

So what happens during backprop is that we start from the back of the graph (the output layer) and we work our way to the beginning (input layer). And when we reach each node, at each node we know the upstream gradients coming back, with respect to the immediate output of that node.

So by the time we’ve reached this node in backprop, we already computed the gradient of our final loss L, with respect to z right? This is the gradient of final loss L with respect to our output z. This is formalized as dL/dz.

So now, what we want to find out next, we want to find out the gradients with respect to just before this node, we want to find the gradients of our final loss L with respect to inputs x and y. For x, the gradient of L with respect to x will be the gradient of L with respect to output z, multiplied by the gradient of z with respect to x. Formalized, dL/dx = dL/dz * dz/dx.

Q: Does this approach only work with instantaneous values, the current values of the function? Or will this work with general function expressions as well? (At each node, can we pass in function expressions or do we only have to work with immediate values of that function)

A: It does work given the current values of the function we plug in, but we can also write an expression for this value still in terms of variables right. So we’ll see that dL/dz is still some expression, and dz/dx is still some expression, thus dL/dx is also an expression. However the numerical/symbolic representation comes from the fact that we plug in these numbers at the time of computation in order to get the value of the gradient with respect to x. This process is exactly the same to find the gradient with respect to y.

The main thing to take away from this is that at each node we just want to have our local gradient of this node that we compute, and then during backprop we receive numerical values of the gradients coming from upstream. We just take the upstream gradient, multiply it by the local gradient, and then this is what we send back to the connected nodes above/backwards - without having to care about anything else besides these immediate surroundings.

Backprop: A more complex example

In this example we are doing backprop on a computational graph that represents the function f(w,x) = COMPLEX FUNCTION SHOWN IN PICTURE In this graph, we can see the computed values for each operation of the function given input values for w and x.

Representing the function as a computational graph

Computing Gradients For Each step of the Computational Graph

Here we have written down at the bottom the expressions for some of the derivatives of the functions used in this computational graph.

When we start with backprop, we start at the very end of the graph. The gradient of the output with respect to the output again, is just 1. dL/L = 1

Moving backwards, we ask - what’s the gradient of the output with respect to the input just before 1/x? If we refer back to the single neuron example, and call the input to this neuron x, then the gradient of the output with respect to the input just before this node would be dZ/dx - also our local gradient would be df/dx and our upstream gradient would be dZ/df. dZ/df df/dx = dZ/dx which is what we’re trying to find. In this case, we know that the upstream gradient is 1, and we know the local gradient df/dx follows the expression df/dx = (-1)/x². Thus the value of the total gradient before this node is (-1/x²)(1.00). Plugging in 1.37 for x, we get dZ/dx = -0.53

In this example we skip over the addition of +1 node. Since that node’s function is just a constant c + x, our table tells us that df/dx =1. Thus calculating dZ/dx = dZ/df df/dx, we get dZ/dx = 1 (-0.53) which is again -0.53. Thus the gradient prior to the +1 node is the same as the gradient after.

Moving on to the exp node, we know that the upstream gradient is -0.53. The expression for the local gradient of this function is e^x. Thus the backprop gradient for this node is dZ/dx = (e^-1.00)*(-0.53) = =0.20

For the -1 node, the upstream gradient is -.20, and the local gradient is -1. Thus the backprop gradient for this node is (-1)(-.20). dZ/dx = .20

At this first branching addition node we now branch off into two separate nodes, one leads to w2 and the other leads to another addition node. We know our upstream gradient is 0.20, and our local gradients for addition is still 1. Thus, for each branch, the backprop gradient dZ/dx = (0.20)*(1) = 0.2

As we move on to the second branching addition node, we do this again. Our upstream gradient here is 0.20, our local gradients are still 1. Thus our two backprop gradients are 0.20.

For our final step, we calculate the gradients for the multiplication node leading up to w0 and x0. Our upstream gradient is 0.20, our local gradient for w0 is the value of x0, and our local gradient for x0 is the value of w0. The gradient for dZ/x0 = (2)(0.2) = 0.4 and the gradient for dZ/w0 = (-1)(0.2) = -0.2.

Using the same procedure, we calculate dZ/x1 = (-2)(0.2) = -0.4 and the gradient for dZ/w1 = (-3)(0.2) = -0.6.

Grouping Computational Nodes into Function Blocks in a Computational Graph

When we’re creating these compuational graphs, we can define the compuational nodes at any granularity that we want. In the above case we broke it down into the absolute simplest that we could. However, we could group certain nodes together into more complex nodes, as long as we can write down/derive the local gradient for that node. As an example, we could group some nodes in the example above to represent a sigmoid function, as shown below.

We can compute the gradient for the sigmoid function and get a nice expression for it, thus we can represent this all the computational nodes in this block as just one sigmoid node.

Patterns in Backward Flow

https://youtu.be/d14TUNcbn1k?t=1881

Q: What is a max gate?

A: Gradient Router - only gives the branch with the highest gradient will be affected by the backprop gradient update. All other branches to this gate will be gradient 0.

Q: What is a multiplication gate?

A: Gradient Switcher - For one branch, the upstream gradient becomes scaled by the value of the other branch to find this gradient

When one node is connected to multiple nodes, the gradients at up at this node. At these branches, we use the multivariate chain rule to get the different gradients flowing back from the upstream nodes, and we add them all. This sum is the total upstream gradient that we use to calculate our gradient for this current node.

Q: So we haven’t yet even updated our weights? We’ve only computed gradients

A: So now that we’ve calculated the gradient we can take a step in the direction of the gradient in order to update our weight parameters. We take the entire framework of what we learned during optimization, and then we apply what we learned here about calculating gradients.

df/dx = Σdf/dq_i * dq_i/dx

So this is saying if x is connected to multiple different elements, which in this case, different q_is

Then the chain rule is going to take the effect of all these intermediate variables on our final output f. And then compound each one with the local effect of our variable x on that intermediate value.

Backpropagation