Fine tune GPT-2 on large text for generate a domain text

Question

Tryin to train GPT-2 on a very large text, in order to generate text from specific domain.
Working with tensorflow2 .

For example, let's say I have all of Harry Potter books :)
And I want to train the GPT-2 on them, so I could later generate text from the Harry Potter domain.

from tensorflow.keras.utils import get_file
from transformers import GPT2Tokenizer, TFGPT2Model

text = '...'
# Length of text: 474429 characters
# 84 unique characters

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = TFGPT2Model.from_pretrained('gpt2-medium')

encoded_input = tokenizer(text, return_tensors='tf') # ERROR
output = model(encoded_input)

input_ids = tokenizer.encode('severus snape', return_tensors='tf')
greedy_output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (149887 > 1024). Running this sequence through the model will result in indexing errors

So how would I make it work?
How to feed the model a large new text to train on?

EDIT:
when Trying to concat, tokenizer works, but model doesn't:

from textwrap import wrap
text_batches = wrap(text, 1000)

encoded_input = None

for tb in text_batches:
    current = tokenizer(tb, return_tensors='tf')
  
    if encoded_input == None:
        encoded_input = current
    else:
        encoded_input['input_ids']      = tf.concat([encoded_input['input_ids'], current['input_ids']], axis=-1)
        encoded_input['attention_mask'] = tf.concat([encoded_input['attention_mask'], current['attention_mask']], axis=1)

output = model(encoded_input) # ERROR

ERROR: InvalidArgumentError: indices[0,1024] = 1024 is not in [0, 1024) [Op:ResourceGather]

What am I missing?

score 3 · Answer 1 · answered Sep 16 '20 at 02:05

3

Your problem is not related to training on different domains. Rather, you're simply providing a text length (apparently 149887 tokens) that's longer than the maximum length that the model can support (1024). You have three options:

Manually truncate your input strings to the max length of tokens.
Set the max_length parameter in the call to your tokenizer, e.g. tokenizer(text, max_length=1024, ...). Be sure to read all the available options for the Tokenizer class.
Revisit why you need a text string of 149K tokens. Is this the whole body of the text? Should you instead use sentences?

answered Sep 16 '20 at 02:05

stackoverflowuser2010

38,621
48
169
217

Thanks for your help. 1. tried it - did not work. added it to the Q. 2. it can't be set more then 1024 when loading pretrained model. 3. how would you do it in sentences? I wish I knew how :) – Sahar Millis Sep 16 '20 at 03:13
If the error says `1024 is not in [0, 1024)`, then you have to be able to read simple math notation. `1024)` means up to but not including 1024, so try 1023. – stackoverflowuser2010 Sep 16 '20 at 04:30
If you want to parse big texts into sentences, then you have three options: (a) use a regular expression (I wouldn't choose this approach); (b) use NLTK sentence parser; or (c) use Spacy sentence parser. All three options are discussed in one StackOverflow post: https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences – stackoverflowuser2010 Sep 16 '20 at 04:34
appreciate the input, but every petrain model has it's own tokens. a different tokenizer it is not the way to go. Especially when using embeddings. maybe I need to figure out how to train the model in 1024 size batchs? – Sahar Millis Sep 16 '20 at 20:39
I didn't say to use a different tokenizer. The problem is that you need to limit the number of tokens per example to 1024 (or whatever is the limit for a given model). – stackoverflowuser2010 Sep 16 '20 at 22:05

Fine tune GPT-2 on large text for generate a domain text

1 Answers1