Encoder Decoder Models

Janardhanan a r

4 min readMar 5, 2021

Seq 2 Seq Models

Applications

Google Translate
Auto Reply
Auto Suggestion

Seminal paper — ILya Sutskever

Machine Translation ==> Seq to Seq Models

Image Captioning

Mathematically

The above statement means find the y which has the highest probability given the input x. This is a very complex probability and there is no closed form solution to this.

In the xi and yi pairs, xi can be in one language and yi will be in another language.

For Machine Translation the loss used can be minimizing cross-entropy or maximizing log likelihood.

minimize cross-entropy or maximize log likelihood

Practical Application

Seq2Seq Model used for machine translation by Google translate as core algorithm for a while

In gmail it was used for auto-reply.

Image-Text captioning

Open Google-chrome and select image search. Type in any description and hit enter. Images/Pictures satisfying the description will be displayed.

Google is good at text search, hence by using the encode-decoder models it has converted image search into a text search.

Encoder Decoder Block Diagram

Choe et all

The main change in this paper is that the context vector is passed as input to all the decoder LSTM cells

LSTM’s can take one fresh input and an input from the previous time step

In this case we have three inputs

Fresh input
Input from the previous time step
Context vector

Hence a new LSTM was developed that can take three inputs.

The problem with this approach was that the LSTM was not optimized leading to lesser adoption in the industry

This architecture did not produce exceptional results when compared to Ilya Sutseker’s approach.

Image ==> Caption Karapthy et all

The encoder is a CNN with the decoder being an LSTM/GRU

The last layer of a CNN basically a non-linear layer is not added. A softmax or tanh layer is not present as the last layer. It is okay not to have this layer since we are not doing any classification.

The o/p of the last layer of the CNN will be the context vector which will be passed as input to the decoder. This will be the input t-1 to the first LSTM.

The final vector is the essence of the image. This is like image context vector.

This vector encapsulates the whole information that is in the image as learnt by the CNN model.

Any CNN model can be used the more complex the CNN better will be the results. But this would mean using more compute power and more training data.

The sequence begins with the predefined word START or all zeroes and again ends with predefined word EOS. We stop consuming the input when we get EOS and stop generating output when the model generates the output EOS.

References

www.appliedaicourse.com

2. Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

Sequence to Sequence Learning with Neural Networks

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks…

arxiv.org

3. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio

https://arxiv.org/abs/1406.1078

4. Deep Visual-Semantic Alignments for Generating Image Descriptions

Andrej Karpathy, Li Fei-Fei

Deep Visual-Semantic Alignments for Generating Image Descriptions

We present a model that generates natural language descriptions of images and their regions. Our approach leverages…