2

I am trying to train huggingface's implementation of the GPT2 model from scratch (meaning I am using their architecture but not using pre-trained weights) but I noticed by looking into the code here https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py that there doesn’t seem to be an implementation for a causal mask.

I could write an ugly for loop and feed to the network my training sequences one token at a time which would not be unefficient. I could also chop up each of my examples token by token, pad them and feed it like a batch, which is probably faster but doesn’t feel super satisfying.

Has anyone of you worked closely with huggingface’s transformers before ? Do you know if there is an implementation of casal mask that I missed, or another way to do what I am describing ?

PS : Yes, I have already read huggingface’s blogpost on training from scratch, but it’s mostly incomplete and the relevant parts concerning training are left out.

Wicket
  • 125
  • 7
Johncowk
  • 342
  • 1
  • 16

0 Answers0