1

I am using fasttext (v=0.9.1) to detect the language of a text (see this).

Norwegian text is being detected as Danish when using this model.

!curl "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin" > lid.bin

import fastText
language_detector=fastText.load_model('lid.bin')
language_detector.predict('Hei Jeg viser til hyggelig sam', k=3)

Output:

(('__label__da', '__label__no', '__label__hu'),
array([9.16624188e-01, 8.25065151e-02, 2.37607688e-04]))

Any help?

mahala
  • 11
  • 2
  • I've not typically seen FastText used for language-detection. (For example, I've not seen common FT implementations include any API calls which, when handed a text, return a language-identifier.) So, showing the actual code you're using to arrive at this determination could help clarify how your process might succeed/fail with certain texts or languages. – gojomo Nov 10 '20 at 17:16
  • 2
    https://fasttext.cc/blog/2017/10/02/blog-post.html - it can be used for detecting languages. – mahala Nov 12 '20 at 11:55
  • Thanks! If that's the specific multi-step technique you're using, perhaps the quality of your results on certain examples is heavily influenced by the training data you're using. What training data are you using, and could you use a larger/better training set to get better results? (Perhaps even: whenever you encounter s known-error, include properly-labeled NO and DA versions of the difficult text, & re-train, to specifically iprove on the hard cases?) – gojomo Nov 12 '20 at 21:31
  • I did testing on 15k danish samples and 700 Norwegian samples and had 95% and 80% accuracy, respectively with identifying them correctly using fast text, see: https://www.npmjs.com/package/@smodin/fast-text-language-detection for research data – Kevin Danikowski Nov 29 '21 at 04:29

1 Answers1

1

It seems that distinguishing the Norwegian and Danish languages ​​is difficult (see this).

fastText is not particularly suitable for this task.

You can try to use polyglot, a python library dedicated to multilingual NLP.

from polyglot.detect import Detector

detector = Detector('Hei Jeg viser til hyggelig sam')
print(detector)

output:

Prediction is reliable: True
Language 1: name: Norwegian   code: no       confidence:  96.0 read bytes:  1189
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

A little note: if you install polyglot, please be careful with dependencies (read this and this).