Attention is All You Need

Introduction Why Not Use RNN? Mainstream approaches for text processing include RNN and encoder-decoder architectures (when structured information is abundant). However, RNNs (including LSTM and GRU) face significant limitations despite their ability to process sequential data: Key Limitations of RNNs High Computational Cost: RNNs suffer from heavy computational demands, especially on long sequences. The sequential nature of RNNs forces them to process tokens one by one (e.g., word-by-word in a sentence), making parallelization impossible. For a sequence of length $t$, RNNs require $t$ sequential steps to compute hidden states $(h_{t})$. where each $h_{t}$ depends on the previous hidden state $(h_{t-1})$ and the current input $(x_{t})$. This sequential processing leads to a time complexity of $O(t)$, making them impractical for long sequences. Information Loss in Long Sequences: Historical information is compressed into fixed-size hidden states $(h_{t})$. As sequences grow longer, early-stage information tends to degrade or vanish due to the limited capacity of hidden states. This makes RNNs ineffective for capturing long-range dependencies. Gradient Issues: Long sequences exacerbate gradient vanishing/explosion problems, hindering stable training. While LSTM and GRU mitigate this to some extent, they still struggle with extremely long contexts. The Rise of Transformer ...

April 12, 2025 · 2 min