Attention is all you need.

What is Sequence modeling?

In case of early language models like LSTM, GRU, and etc…

Untitled

Untitled

It has a infromation of past layer input and output.

$$ \begin{align} l_{t+1}\left(h_1,\dots, h_t\right) = h_{t+1} \end{align} $$

It has same complexity with recursive function. (In a big-O notation ($O(k^n)$)).

In this reason it is hard to make long sentences. And its time complexity also very large.

So in this reason, transformer doesn’t send the infromation output as input.

Architecture of Transformer

Untitled

Before, we start to talk about the what is attention

What is attention

Attention makes context vector dont lose the information of before state output.

Untitled

In case of RNN, only makes context vector in end of the sequence. It makes the lose of information from gradient descent or so on.

Attention can solve this problem.

Untitled