Seq 2 Seq Models
- Google Translate
- Auto Reply
- Auto Suggestion
Seminal paper — ILya Sutskever
Machine Translation ==> Seq to Seq Models
The above statement means find the y which has the highest probability given the input x. This is a very complex probability and there is no closed form solution to this.
In the xi and yi pairs, xi can be in one language and yi will be in another language.
For Machine Translation the loss used can be minimizing cross-entropy or maximizing log likelihood.
Seq2Seq Model used for machine translation by Google translate as core algorithm for a while
In gmail it was used for auto-reply.
Open Google-chrome and select image search. Type in any description and hit enter. Images/Pictures satisfying the description will be displayed.
Google is good at text search, hence by using the encode-decoder models it has converted image search into a text search.
Encoder Decoder Block Diagram
Choe et all
The main change in this paper is that the context vector is passed as input to all the decoder LSTM cells
LSTM’s can take one fresh input and an input from the previous time step
In this case we have three inputs
- Fresh input
- Input from the previous time step
- Context vector
Hence a new LSTM was developed that can take three inputs.
The problem with this approach was that the LSTM was not optimized leading to lesser adoption in the industry
This architecture did not produce exceptional results when compared to Ilya Sutseker’s approach.
Image ==> Caption Karapthy et all
The encoder is a CNN with the decoder being an LSTM/GRU
The last layer of a CNN basically a non-linear layer is not added. A softmax or tanh layer is not present as the last layer. It is okay not to have this layer since we are not doing any classification.
The o/p of the last layer of the CNN will be the context vector which will be passed as input to the decoder. This will be the input t-1 to the first LSTM.
The final vector is the essence of the image. This is like image context vector.
This vector encapsulates the whole information that is in the image as learnt by the CNN model.
Any CNN model can be used the more complex the CNN better will be the results. But this would mean using more compute power and more training data.
The sequence begins with the predefined word START or all zeroes and again ends with predefined word EOS. We stop consuming the input when we get EOS and stop generating output when the model generates the output EOS.
2. Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks…
3. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
4. Deep Visual-Semantic Alignments for Generating Image Descriptions