Building a transformer from scratch with PyTorch
Since the original "Attention is all you need" [Vaswani et al., 2017] paper the transformer architecture has been a foundation for modern language models. To understand what these models are doing it is important to understand how they work. As an exercise in self-understanding, I will build a small, 5,242,000 parameter encoder-decoder transformer using the original paper and train it on a small dataset [Opus books english-spanish translation].