Self-Attention Explained: Implementing Scaled Dot-Product Attention with PyTorch

LMs are trained to predict the next word based on the context of the previous words. However, to make accurate predictions, LMs need to understand the relationship between words in the sentence. This is the objective of attention mechanism — it helps the LM to focus on the most relevant words to that context to make predictions. ! In this post, we’ll implement scaled dot-product attention in a simple way. Back in the day, RNNs were the standard for sequence-to-sequence tasks, but everything changed when attention mechanisms came into the picture. Then, the groundbreaking paper “Attention Is All You Need” shook things up even more, showing that RNNs weren’t necessary at all—attention alone could handle it. Since then, attention has become the backbone of modern architectures like Transformers.

MultiHead Attention Explained:Implementing Masked Multihead attention from Scratch in PyTorch

Implementing Multihead attention from scratch with pytorch In our previous article, we built Self-Attention from scratch using PyTorch. If you haven’t checked that out yet, I highly recommend giving it a read before reading this one! Now, let’s take things a step further and implement Multi-Head Attention from scratch. This post focuses more on the implementation rather than the theory, so I assume you’re already familiar with how self-attention works. Let’s get started!

LSTM illustration

Implementing LSTM from scratch in PyTorch step-by-step.

LSTM from Scratch In this post, we will implement a simple next word predictor LSTM from scratch using torch. A gentle Introduction to LSTM Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997). As LSTMs are also a type of Recurrent Neural Network, they too have a hidden state, but they have another memory cell called the cell state as well.