I am trying to train huggingface's implementation of the GPT2 model from scratch (meaning I am using their architecture but not using pre-trained weights) but I noticed by looking into the code here https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py that there doesn’t seem to be an implementation for a causal mask.
I could write an ugly for loop
and feed to the network my training sequences one token at a time which would not be unefficient. I could also chop up each of my examples token by token, pad them and feed it like a batch, which is probably faster but doesn’t feel super satisfying.
Has anyone of you worked closely with huggingface’s transformers before ? Do you know if there is an implementation of casal mask that I missed, or another way to do what I am describing ?
PS : Yes, I have already read huggingface’s blogpost on training from scratch, but it’s mostly incomplete and the relevant parts concerning training are left out.