Why & What is GRU (Simplified) ?
In RNN(recurrent neural network) it can be quite difficult while backpropagation because of the vanishing gradient problem for outputs of the errors associated with the later timesteps to affect the computations that are earlier. GRU(gated recurrent unit) has modifications to RNN hidden layer that makes it much better at capturing long-range connections and helps a lot with vanishing gradient problem .
Let’s understand it better with an example!
let’s learn GRU unit through an example “The cat, which already ate, was full.” on which we would see sequence modeling, In RNN it might be difficult to get a neural network to realize that it needs to memorize. Did you noticed a singular or a plural noun so that later on in the sequence it can generate either was or were, depending on whether it was singular or plural.
Now to predict the subject of the sentence was singular or plural in RNN due to vanishing gradient problem it is quite difficult for the model to predict if the subject is singular or plural so that later in the sequence it can generate accordingly.
In GRU we use a variable C = memory cell and at time t the memory cell will have some value which is C<ᵗ> NOTE : GRU Unit will actually output an activation value a<ᵗ> which is equal to C<ᵗ> i.e. a<ᵗ> = C<ᵗ>
At every time step, we’re going to consider an overwriting the memory cell with a value, a candidate for replacing C<ᵗ> which is C tilde
we'll compute this using tanh of w꜀(is the parameter matrix) and we’ll pass parameter matrix, previous value of the memory cell, the activation value, as well as current value x<ᵗ> plus a bias.
The key idea of GRU is it will have Gates represented as gamma Γ, Γu stands for updated gate.
we’ll compute this using Sigmoid function. Therefore the value of Γu will be between 0 and 1. Till now we have come up with a candidate where we are thinking of updating C using C tilde and then the gate Γ will decide whether or not we actually update it.
Now let’s see our example “The cat, which already ate, was full.” here C memory cell is going to be set to zero or one depending on whether the word you’re conserving really the subject of the sentence is singular or plural.
The cat, which already ate, was full. it’s singular let’s say we set gamma u at cat to 1 and if it was plural then may be zero.
Then GRU will memorize the value of C<ᵗ> all the way until was where this is still equal to one and so that tells it was singular so use the choice was
The job of the gate Γ is to decide when do you update this value. For example: when you see the phrase The cat, you know that you’re talking about a new concept, the subject of the sentence cat. That would be a good time to update this bit and then may be when you are done using the cat was full, then you know i don't need to memorize anymore i can just forget that.
GRU EQUATION
Make it more efficient can use other gate Γr ‘r’ here stands for relevance it tells you how relevant is C<ᵗ⁻¹> to compute the next candidate for C<ᵗ>
Because it’s Quite easy to set the Gate to Zero, then up to numerical round off the updated gate will be essentially zero, very close to Zero. when that’s the case then this updated equation and sub setting C<ᵗ>=C<ᵗ⁻¹> and so this is very good at maintaining the value for cell and because Gamma can be so close to Zero, It does not suffer from vanishing gradient problem.