Language detection for short string in a user content generated context

Question

I have some question about the detection of short string. I need to detect the language of text sent in a chat, and I am faced with 2 problems:

the lenght of the message
the errors that may be in it and the noise (emoji etc...)

but for the noise, I clean the message and that work fine but for the lenght of the message, it's a problem. For exemple If a user say hi, fasttext detect the language as a deutch text, but google translate detect it to an english text. And the most likely it is a message in English. So I try to train my own fasttext model but how can I can adjust the model to have better result in short string? I need to train the model with dictionnary of a lot of language to have better result?

I use fasttext because it's the most accurate language detector. Here is also an exemple of the problem with fasttext:

# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

import fasttext

text = "Hi"

pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)

predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))

Did you try using existing packages like those introduced [here](https://stackoverflow.com/questions/39142778/how-to-determine-the-language-of-a-piece-of-text)? — meti, Dec 19 '22 at 18:20
yes, and fasttext is the most accurate library arcording by this article: https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c — Jourdelune, Dec 19 '22 at 19:48

Stefano Fiorucci - anakin87 · Answer 1 · 2022-12-23T10:00:50.730

2

In my experience, common approaches based on fastText or other classifiers struggle with short texts.

You could try lingua, a language detection library that is available for Python, Java, Go, and Rust.

Among its strengths:

...yields pretty accurate results on both long and short text, even on single words and phrases.

She draws on both rule-based and statistical methods but does not use any dictionaries of words.

She does not need a connection to any external API or service either.

As you can read here, it seems that in Lingua you can also restrict the set of languages to be considered

edited Dec 23 '22 at 10:00

answered Dec 19 '22 at 20:49

Stefano Fiorucci - anakin87

3,143
7
26

I have already tested lingua but it has not impressive result for short text, for exemple this script: ```py from lingua import Language, LanguageDetectorBuilder detector = LanguageDetectorBuilder.from_all_languages().build() text = """ Hello """ confidence_values = detector.compute_language_confidence_values(text.strip()) for language, value in confidence_values: print(f"{language.name}: {value:.2f}") ``` return in first language sotho. – Jourdelune Dec 19 '22 at 21:02
As you can read here https://github.com/pemistahl/lingua-py#5-why-is-it-better-than-other-libraries it seems that in Lingua you can restrict the set of languages to be considered. Does this help? – Stefano Fiorucci - anakin87 Dec 19 '22 at 21:10
2

I have to detect the message language on discord, the problem is that the messages can be in any language... and very short! after it is true that 40 languages must be enough for very short texts, to do to this problem I finally think that the best solution is to restrict the number of language if a text is less than 50 characters for example... – Jourdelune Dec 19 '22 at 21:17
@Jourdelune Another option that may help to improve performance is to have default options when one word (e.g. "hi") can be used in many languages, and then to set the default based on the native language probability distribution of users (if you can access that). Finally, you can set the default, but have it update as the user engages in more digalogue---using the additional dialogue as a larger input to the model until you reach some confidence threshold. – Kyle F Hartzenberg Dec 20 '22 at 01:33
I can access to the native user language so I think that can be interessing,moreover there are like 30 languages that user use, so I can also train a detector for this 30 languages and more the text is short, more this detector is important – Jourdelune Dec 20 '22 at 08:39

score 2 · Accepted Answer · answered Dec 31 '22 at 21:56

I have find a way to have better result. If you sum all probability of all languages on different detector like fastText and lingua and add for short text a detection with dictionnary. You can have very good result (for my task, I also made a fastText model trained on my data). I have made a demo for that but moderator don't accept it so I can't send the link of the repo.

Language detection for short string in a user content generated context

2 Answers2