How to classify natural languages written in other forms of characters?

Question

Background

I would like to classify all the three phrases as Chinese, 'zh' using fastText.

["Ni hao!", '你好!', 'ni hao!']

However the trained model looks not applicable for the semantic classification.

Is there any idea to do the same task with different ways?

Output

[('zh', 0.9305274486541748)]
[('eo', 0.9765485525131226)]
[('hr', 0.6364055275917053)]

Code

sample.py

from fasttext import load_model
model = load_model("lid.176.bin")

speech_texts = ["Ni hao!", '你好!', 'ni hao!']

def categolize_func(texts, model, k):
    for i in range(len(texts)):
        text = texts[0]
        label, prob = model.predict(text, k)
        return list(zip([l.replace("__label__", "") for l in label], prob))

print(categolize_func(speech_texts, model, 1))

Maybe this answer (https://stackoverflow.com/questions/64769198/fasttext-models-detecting-norwegian-text-as-danish/64771913#64771913) can provide some new ideas for you. In any case, your sentences seem difficult to be correctly classified, both using fasttext and polyglot. — Stefano Fiorucci - anakin87, Jan 13 '21 at 08:21

score 1 · Accepted Answer · answered Jan 15 '21 at 09:23

I do not think this is a fair assessment of the FastText model. It was trained on much longer sentences than you are using for your quick test, so is a sort of train-test data mismatch. I would also guess that most of the Chinese data that the model used at the training time were not in Latin script and there it might have problems with it.

There exist other models for language identification:

langid.py uses simple trigram statistics.
langdetect is a port of an old open-source project by Google that uses a simple ML model over character statistics.
Spacy has a language detection extension.
Polyglot toolkit for multilingual NLP also has language detection.

However, I would suspect that all of them will have problems with such short text snippets. If this is really how your data look like, then the best thing would be training your own FastText model with the training data matching your use case. For instance, if you are only interested in detecting Chinese, you can classify into two classes: Chinese and non-Chinese.

How to classify natural languages written in other forms of characters?

Background

Output

Code

1 Answers1