Attention is all you need.
In case of early language models like LSTM, GRU, and etc…


It has a infromation of past layer input and output.
$$ \begin{align} l_{t+1}\left(h_1,\dots, h_t\right) = h_{t+1} \end{align} $$
It has same complexity with recursive function. (In a big-O notation ($O(k^n)$)).
In this reason it is hard to make long sentences. And its time complexity also very large.
So in this reason, transformer doesn’t send the infromation output as input.

Before, we start to talk about the what is attention
Attention makes context vector dont lose the information of before state output.

In case of RNN, only makes context vector in end of the sequence. It makes the lose of information from gradient descent or so on.
Attention can solve this problem.
