Dec 23, 2021
Sorry for the late response here. Yes, I intentionally left off the masked attention for the sake of simplicity. In fact, *every* decoder layer would need to include masked attention -- not just the first one.
Sorry for the late response here. Yes, I intentionally left off the masked attention for the sake of simplicity. In fact, *every* decoder layer would need to include masked attention -- not just the first one.