Self-Attention

LMs are trained to predict the next word based on the context of the previous words. However, to make accurate predictions, LMs need to understand the relationship between words in the sentence. This is the objective of attention mechanism — it helps the LM to focus on the most relevant words to that context to make predictions. ! In this post, we’ll implement scaled dot-product attention in a simple way. Back in the day, RNNs were the standard for sequence-to-sequence tasks, but everything changed when attention mechanisms came into the picture. Then, the groundbreaking paper “Attention Is All You Need” shook things up even more, showing that RNNs weren’t necessary at all—attention alone could handle it. Since then, attention has become the backbone of modern architectures like Transformers.

Self-Attention

Self-Attention Explained: Implementing Scaled Dot-Product Attention with PyTorch

MultiHead Attention Explained:Implementing Masked Multihead attention from Scratch in PyTorch