-1

Background

I would like to classify all the three phrases as Chinese, 'zh' using fastText.

["Ni hao!", '你好!', 'ni hao!']

However the trained model looks not applicable for the semantic classification.

Is there any idea to do the same task with different ways?

Output

[('zh', 0.9305274486541748)]
[('eo', 0.9765485525131226)]
[('hr', 0.6364055275917053)]

Code

sample.py

from fasttext import load_model
model = load_model("lid.176.bin")

speech_texts = ["Ni hao!", '你好!', 'ni hao!']

def categolize_func(texts, model, k):
    for i in range(len(texts)):
        text = texts[0]
        label, prob = model.predict(text, k)
        return list(zip([l.replace("__label__", "") for l in label], prob))

print(categolize_func(speech_texts, model, 1))
halt
  • 393
  • 5
  • 17
  • Maybe this answer (https://stackoverflow.com/questions/64769198/fasttext-models-detecting-norwegian-text-as-danish/64771913#64771913) can provide some new ideas for you. In any case, your sentences seem difficult to be correctly classified, both using fasttext and polyglot. – Stefano Fiorucci - anakin87 Jan 13 '21 at 08:21

1 Answers1

1

I do not think this is a fair assessment of the FastText model. It was trained on much longer sentences than you are using for your quick test, so is a sort of train-test data mismatch. I would also guess that most of the Chinese data that the model used at the training time were not in Latin script and there it might have problems with it.

There exist other models for language identification:

However, I would suspect that all of them will have problems with such short text snippets. If this is really how your data look like, then the best thing would be training your own FastText model with the training data matching your use case. For instance, if you are only interested in detecting Chinese, you can classify into two classes: Chinese and non-Chinese.

Jindřich
  • 10,270
  • 2
  • 23
  • 44