3

Tryin to train GPT-2 on a very large text, in order to generate text from specific domain.
Working with tensorflow2 .

For example, let's say I have all of Harry Potter books :)
And I want to train the GPT-2 on them, so I could later generate text from the Harry Potter domain.

from tensorflow.keras.utils import get_file
from transformers import GPT2Tokenizer, TFGPT2Model

text = '...'
# Length of text: 474429 characters
# 84 unique characters

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = TFGPT2Model.from_pretrained('gpt2-medium')

encoded_input = tokenizer(text, return_tensors='tf') # ERROR
output = model(encoded_input)

input_ids = tokenizer.encode('severus snape', return_tensors='tf')
greedy_output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (149887 > 1024). Running this sequence through the model will result in indexing errors

So how would I make it work?
How to feed the model a large new text to train on?

EDIT:
when Trying to concat, tokenizer works, but model doesn't:

from textwrap import wrap
text_batches = wrap(text, 1000)

encoded_input = None

for tb in text_batches:
    current = tokenizer(tb, return_tensors='tf')
  
    if encoded_input == None:
        encoded_input = current
    else:
        encoded_input['input_ids']      = tf.concat([encoded_input['input_ids'], current['input_ids']], axis=-1)
        encoded_input['attention_mask'] = tf.concat([encoded_input['attention_mask'], current['attention_mask']], axis=1)

output = model(encoded_input) # ERROR

ERROR: InvalidArgumentError: indices[0,1024] = 1024 is not in [0, 1024) [Op:ResourceGather]

What am I missing?

Sahar Millis
  • 801
  • 2
  • 13
  • 21

1 Answers1

3

Your problem is not related to training on different domains. Rather, you're simply providing a text length (apparently 149887 tokens) that's longer than the maximum length that the model can support (1024). You have three options:

  1. Manually truncate your input strings to the max length of tokens.

  2. Set the max_length parameter in the call to your tokenizer, e.g. tokenizer(text, max_length=1024, ...). Be sure to read all the available options for the Tokenizer class.

  3. Revisit why you need a text string of 149K tokens. Is this the whole body of the text? Should you instead use sentences?

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • Thanks for your help. 1. tried it - did not work. added it to the Q. 2. it can't be set more then 1024 when loading pretrained model. 3. how would you do it in sentences? I wish I knew how :) – Sahar Millis Sep 16 '20 at 03:13
  • If the error says `1024 is not in [0, 1024)`, then you have to be able to read simple math notation. `1024)` means up to but not including 1024, so try 1023. – stackoverflowuser2010 Sep 16 '20 at 04:30
  • If you want to parse big texts into sentences, then you have three options: (a) use a regular expression (I wouldn't choose this approach); (b) use NLTK sentence parser; or (c) use Spacy sentence parser. All three options are discussed in one StackOverflow post: https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences – stackoverflowuser2010 Sep 16 '20 at 04:34
  • appreciate the input, but every petrain model has it's own tokens. a different tokenizer it is not the way to go. Especially when using embeddings. maybe I need to figure out how to train the model in 1024 size batchs? – Sahar Millis Sep 16 '20 at 20:39
  • I didn't say to use a different tokenizer. The problem is that you need to limit the number of tokens per example to 1024 (or whatever is the limit for a given model). – stackoverflowuser2010 Sep 16 '20 at 22:05