The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.

Don’t be fooled by me throwing around the word “Self-attention” like it’s a concept everyone should be familiar with.

Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence.

The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

The first probability distribution has the highest probability at the cell associated with the word “i”. The second probability distribution has the highest probability at the cell associated with the word “Am”.

Another way to do it would be to hold on to, say, the top two words, then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept.

This article was summarized automatically with AI / Article-Σ ™/ BuildR BOT™.

Original link