1

I searched a lot for this but havent still got a clear idea so I hope you can help me out:

I am trying to translate german texts to english! I udes this code:


tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en")

batch = tokenizer(
    list(data_bert[:100]),
    padding=True,
    truncation=True,
    max_length=250,
    return_tensors="pt")["input_ids"]

results = model(batch)  

Which returned me a size error! I fixed this problem (thanks to the community: https://github.com/huggingface/transformers/issues/5480) with switching the last line of code to:

results = model(input_ids = batch,decoder_input_ids=batch)

Now my output looks like a really long array. What is this output precisely? Are these some sort of word embeddings? And if yes: How shall I go on with converting these embeddings to the texts in the english language? Thanks alot!

Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
soulwreckedyouth
  • 465
  • 1
  • 3
  • 12

2 Answers2

1

I think one possible answer to your dilemma is provided in this question: https://stackoverflow.com/questions/61523829/how-can-i-use-bert-fo-machine-translation#:~:text=BERT%20is%20not%20a%20machine%20translation%20model%2C%20BERT,there%20are%20doubts%20if%20it%20really%20pays%20off.

Practically with the output of BERT, you get a vectorized representation for each of your words. In essence, it is easier to use the output for other tasks, but trickier in the case of Machine Translation.

A good starting point of using a seq2seq model from the transformers library in the context of machine translation is the following: https://github.com/huggingface/notebooks/blob/master/examples/translation.ipynb.

The example above provides how to translate from English to Romanian.

Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
1

Adding to Timbus's answer,

What is this output precisely? Are these some sort of word embeddings?

results is of type <class 'transformers.modeling_outputs.Seq2SeqLMOutput'> and you can do

results.__dict__.keys()

to check that results contains the following:

dict_keys(['loss', 'logits', 'past_key_values', 'decoder_hidden_states', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_hidden_states', 'encoder_attentions'])

You can read more about this class in the huggingface documentation.

How shall I go on with converting these embeddings to the texts in the english language?

To interpret the text in English, you can use model.generate which is easily decodable in the following way:

predictions = model.generate(batch)
english_text = tokenizer.batch_decode(predictions)
kkgarg
  • 1,246
  • 1
  • 12
  • 28