Gated RNN based Architectures: Language Generation Become Easier

14 Sept 2019 What will see in this article:

Intro to RNN.
RNN limitation.
LSTM discovery.

The area of NLG has seen great deal of experiments than any other area in NLP, whether it might be for Machine Translation, Speech Recognition or Utterance Genearation for Dialogue Systems. The spark was hidden utill it was blown up by Mikolov in INTERSPEECH10 paper on the use of RNN for Language Modeling to improve speech recognition in 2010. And his next contributions of word embedding 1 and 2 have made the DEEP Architectures, permanent actors of state of the art (SOTA) for Language Processing Applications in 2013.

Subsequently, people have proposed several configurations in the the architecture and its functionality as per the need. RNN now to come every now and then with new design and flavor to attract and astonish the customers (oops programmers actually). For example, RNN was not seems to performing well for language modelling/generation due to its single layer(hidden as usual) architecture (no flavor added). It was acutally suffering from a great disease of vanishing gradient and limiting its power to capture long distant relations. Doctors(researcers actually) having years of experience proposed a cure to it by helping it with three more pills (hidden layer actually) added to the architecture. Thus RNN with this newly added flavor now became LSTM, a great Sequence Modeller with added hidden layers to forget, keep and output the latent parameters in the architectures.

Source: Understanding LSTM Networks

Let's look at the problem (diseass) of vanishing gradient a bit closer. It requires real mathematical understanding of sigmoid function (a perfect activation function), derivation (calculus not biology), gradient and its back-propogation. Look at following examples:

We, travelling to India, ........... have seen many historical places.
He, travelling to India, ........... has seen many historical places.

If you look into the above two examples, both have some words under long-distant relation. In Example 1, We and have are dependent. Similarly, in Example 2, He is realated with has. In both examples, the dependent words are sparsely distributed(long-distant). RNN incapable of learning this long-distant dependencies because of decay of information through time (called vanishing gradient). Why the gradient looses its power through time? The answer is : With long sequences of data, RNN hidden layer goes through a long series of updation (back-propagation of gradient) and after enough iteration, gradient approaches to zero and no further learning is possible. Its because (you must know about activation function i.e. sigmoid), activation function squishes a large input sapce into small input range between 0 and 1. It causes small changes in output on large changes in the input. LSMT was introduced in 1997 and followed and improved by many researchers later. and later refined by many researchers.

Source: Understanding LSTM Networks

How the LSTM has sort the problem of vanishign gradient? It is answered by a series of illustrative steps. In simple word, a distant relation in a sequence, no matter how distant it is, has to be learned, relearned in order to model the context in more accurately. Clueless? Look at this example:

Soni went to her room and slept. But, Rahul kept writing ..his...

Try to observe a case of gender depencies. her is dependent on Soni. But in next statement, subject got changed to Rahul now, so the gender detection tool of the LSTM should also be updated. Before updating the, detector property, parameter vector has to be refreshed (forget) in condition with the given input. This entire process was lacking in simple RNN to limit its capablity to effectively establish the relation in a sequence. LSTM is made to own this behaviour naturally .

Shrikant Malviya Publication Tutorials CV

Gated RNN based Architectures: Language Generation Become Easier

References