Tryin to train GPT-2 on a very large text, in order to generate text from specific domain.
Working with tensorflow2 .
For example, let's say I have all of Harry Potter books :)
And I want to train the GPT-2 on them, so I could later generate text from the Harry Potter domain.
from tensorflow.keras.utils import get_file
from transformers import GPT2Tokenizer, TFGPT2Model
text = '...'
# Length of text: 474429 characters
# 84 unique characters
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = TFGPT2Model.from_pretrained('gpt2-medium')
encoded_input = tokenizer(text, return_tensors='tf') # ERROR
output = model(encoded_input)
input_ids = tokenizer.encode('severus snape', return_tensors='tf')
greedy_output = model.generate(input_ids, max_length=50)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))
ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (149887 > 1024). Running this sequence through the model will result in indexing errors
So how would I make it work?
How to feed the model a large new text to train on?
EDIT:
when Trying to concat, tokenizer works, but model doesn't:
from textwrap import wrap
text_batches = wrap(text, 1000)
encoded_input = None
for tb in text_batches:
current = tokenizer(tb, return_tensors='tf')
if encoded_input == None:
encoded_input = current
else:
encoded_input['input_ids'] = tf.concat([encoded_input['input_ids'], current['input_ids']], axis=-1)
encoded_input['attention_mask'] = tf.concat([encoded_input['attention_mask'], current['attention_mask']], axis=1)
output = model(encoded_input) # ERROR
ERROR: InvalidArgumentError: indices[0,1024] = 1024 is not in [0, 1024) [Op:ResourceGather]
What am I missing?