Dec 23, 2021
I think "wrong" is a harsh characterization. You're correct that it's not possible to implement *causal* models with this code, as presented. But there are quite a few transformer-based models that don't use causal attention. (E.g. ViT, DETR, BERT, ...)
Excluding masked attention allowed for a much cleaner implementation. This is just a tutorial/walkthrough -- not intended to be production-level code.