I have some question about the detection of short string. I need to detect the language of text sent in a chat, and I am faced with 2 problems:
- the lenght of the message
- the errors that may be in it and the noise (emoji etc...)
but for the noise, I clean the message and that work fine but for the lenght of the message, it's a problem. For exemple If a user say hi, fasttext detect the language as a deutch text, but google translate detect it to an english text. And the most likely it is a message in English. So I try to train my own fasttext model but how can I can adjust the model to have better result in short string? I need to train the model with dictionnary of a lot of language to have better result?
I use fasttext because it's the most accurate language detector. Here is also an exemple of the problem with fasttext:
# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
import fasttext
text = "Hi"
pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)
predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))