Training huggingface's GPT2 from scratch : how to implement causal mask?

Asked Apr 01 '20 at 10:49

Active Aug 27 '23 at 15:18

Viewed 936 times

I am trying to train huggingface's implementation of the GPT2 model from scratch (meaning I am using their architecture but not using pre-trained weights) but I noticed by looking into the code here https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py that there doesn’t seem to be an implementation for a causal mask.

I could write an ugly for loop and feed to the network my training sequences one token at a time which would not be unefficient. I could also chop up each of my examples token by token, pad them and feed it like a batch, which is probably faster but doesn’t feel super satisfying.

Has anyone of you worked closely with huggingface’s transformers before ? Do you know if there is an implementation of casal mask that I missed, or another way to do what I am describing ?

PS : Yes, I have already read huggingface’s blogpost on training from scratch, but it’s mostly incomplete and the relevant parts concerning training are left out.

edited Aug 27 '23 at 15:18

Wicket

asked Apr 01 '20 at 10:49

Johncowk

Hello did you figure this out? I am in a similar situation and a bit clueless. – Bot_Start Aug 12 '20 at 13:01
Nope, in the end I went for the ```for``` loop solution... Given that you ask, I take that they haven't implemented it yet ? – Johncowk Aug 14 '20 at 08:39

Training huggingface's GPT2 from scratch : how to implement causal mask?

0 Answers0

Linked