4

I'm following this OpenAI tutorial about fine-tuning.

I already generated the dataset with the openai tool. The problem is that the outputs encoding (inference result) is mixing UTF-8 with non UTF-8 characters.

The generated model looks like this:

{"prompt":"Usuario: Quién eres\\nAsistente:","completion":" Soy un Asistente\n"}
{"prompt":"Usuario: Qué puedes hacer\\nAsistente:","completion":" Ayudarte con cualquier gestión o ofrecerte información sobre tu cuenta\n"}

For instance, if I ask "¿Cómo estás?" and there's a trained completion for that sentence: "Estoy bien, ¿y tú?", the inference often returns exactly the same (which is good), but sometimes it adds non-encoded words: "Estoy bien, ¿y tú? Cuéntame algo de ti", adding "é" instead of "é".

Sometimes, it returns exactly the same sentence that was trained for, with no encoding issues. I don't know if the inference is taking the non-encoded characters from my model or from somewhere else.

What should I do? Should I encode the dataset in UTF-8? Should I leave the dataset with UTF-8 and decode the bad encoded chars in the response?

The OpenAI docs for fine-tuning don't include anything about encoding.

newiatester
  • 76
  • 1
  • 6

1 Answers1

1

I faced the same issue dealing with Portuguese strings.

Try to use .encode("cp1252").decode() after the string:

"Cuéntame algo de ti".encode("cp1252").decode()

This should result in:

"Cuéntame algo de ti"

cp1252 relates to the windows-1252 Western Europe codec. If that's not working, try another codec from here: https://docs.python.org/3.7/library/codecs.html#standard-encodings

Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
  • There's a problem when doing this to a string that contains both encoded and decoded characters, which is the case. So I think that happens because the model is merging different sentences, some well encoded and some wrong, so the problem is not solved with this. Maybe I trained the model incorrectly.... An example would be: "Estoy bien, ¿y tú? Cuéntame algo de ti". With this sentence, I don't know what to do. – newiatester Feb 23 '22 at 16:48