fasttext models detecting norwegian text as danish

Question

I am using fasttext (v=0.9.1) to detect the language of a text (see this).

Norwegian text is being detected as Danish when using this model.

!curl "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin" > lid.bin

import fastText
language_detector=fastText.load_model('lid.bin')
language_detector.predict('Hei Jeg viser til hyggelig sam', k=3)

Output:

(('__label__da', '__label__no', '__label__hu'),
array([9.16624188e-01, 8.25065151e-02, 2.37607688e-04]))

Any help?

I've not typically seen FastText used for language-detection. (For example, I've not seen common FT implementations include any API calls which, when handed a text, return a language-identifier.) So, showing the actual code you're using to arrive at this determination could help clarify how your process might succeed/fail with certain texts or languages. — gojomo, Nov 10 '20 at 17:16
https://fasttext.cc/blog/2017/10/02/blog-post.html - it can be used for detecting languages. — mahala, Nov 12 '20 at 11:55
Thanks! If that's the specific multi-step technique you're using, perhaps the quality of your results on certain examples is heavily influenced by the training data you're using. What training data are you using, and could you use a larger/better training set to get better results? (Perhaps even: whenever you encounter s known-error, include properly-labeled NO and DA versions of the difficult text, & re-train, to specifically iprove on the hard cases?) — gojomo, Nov 12 '20 at 21:31
I did testing on 15k danish samples and 700 Norwegian samples and had 95% and 80% accuracy, respectively with identifying them correctly using fast text, see: https://www.npmjs.com/package/@smodin/fast-text-language-detection for research data — Kevin Danikowski, Nov 29 '21 at 04:29

score 1 · Answer 1 · answered Nov 10 '20 at 15:33

1

It seems that distinguishing the Norwegian and Danish languages is difficult (see this).

fastText is not particularly suitable for this task.

You can try to use polyglot, a python library dedicated to multilingual NLP.

from polyglot.detect import Detector

detector = Detector('Hei Jeg viser til hyggelig sam')
print(detector)

output:

Prediction is reliable: True
Language 1: name: Norwegian   code: no       confidence:  96.0 read bytes:  1189
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

A little note: if you install polyglot, please be careful with dependencies (read this and this).

answered Nov 10 '20 at 15:33

Stefano Fiorucci - anakin87

3,143
7
26

Thanks @anakin87. Will check this with respect to performance , model size, etc. – mahala Nov 12 '20 at 11:59
I edited your question. When the staff reopens it, please accept my answer. – Stefano Fiorucci - anakin87 Nov 12 '20 at 12:14
@anakin87 - your edit, unfortunately, is invalid. That isn't the OP's original question, and edits should not be that extreme. What you suggested was not ever part of the OP's question, and doesn't belong there, as such. – David Makogon Nov 21 '20 at 01:28
@DavidMakogon maybe you're right, but in the edit I only add the code that mahala used but not reported. – Stefano Fiorucci - anakin87 Nov 21 '20 at 09:34

fasttext models detecting norwegian text as danish

1 Answers1

Linked