Stemming of the multilingual text corpus

Question

I have a text corpus with item descriptions in English, Russian and Polish.

This text corpus has 68K observations. Some of these observations are written in English, some in Russian, and some in Polish.

Could you tell me how properly and cost-efficiently implement a word stemming in this case? I can not use an English stemmer on Russian words and vice versa.

Unfortunately, I could not find a good language identifier. E.g. langdetect works too slow and often incorrectly. For example, I try to identify language of english word 'today':

detect("today") 
"so" 
# i.e Somali

So far my code implementation looks bad. I just use one stemmer on another:

import nltk
# polish stemmer
from pymorfologik import Morfologik

clean_items = []

# create stemmers

snowball_en = nltk.SnowballStemmer("english")
snowball_ru = nltk.SnowballStemmer("russian")
stemmer_pl = Morfologik()

# loop over each item; create an index i that goes from 0 to the length
# of the item list 

for i in range(0, num_items):
    # Call our function for each one, and add the result to the list of
    # clean items

    cleaned = items.iloc[i]

    # to word stem
    clean_items.append(snowball_ru.stem(stemmer_pl(snowball_en.stem(cleaned))))

How about detecting language from a sentence/token of the text first and then using the appropriate stemmer? — grshankar, Aug 27 '18 at 12:24
You can make a rough word-language classifier by exploiting character existence and/or frequencies, and phonotactics. You might even add a fourth class, which would contain words that can't be classified, and likely due to length would not even need to be (e.g. English article "a", Czech conjunction "a"). — Amadan, Aug 27 '18 at 12:25
Or use a ready-made one: https://stackoverflow.com/questions/3182268/nltk-and-language-detection — Amadan, Aug 27 '18 at 12:26
@grshankar, @Amadan, Unfortunately, I could not find a good language identifier. e.g.`langdetect` work too slow and often incorrectly. e.g. I try to identify word 'today': `detect("today")` it prints me "so" i.e Somali — lemon, Aug 27 '18 at 12:43
@lemon I have used `langdetect` too. I agree it is slow, but gave good results for sentences (I did not use it on tokenized words). Have you given `langid` a try? I cannot guaranty the quality as I haven't used it myself but might be worth a shot — grshankar, Aug 27 '18 at 12:49
@grshankar, did you use it for whole sentences? In my case, in one sentence there can be both Russian and English words. — lemon, Aug 27 '18 at 12:55
I havent use `langid`. Thank you for advise, I will try to use it! — lemon, Aug 27 '18 at 12:56
That's part of the reason I suggested you make your own classifier. But even `langdetect` can be okay if you tweak it (because honestly the API is kind of a mess :P ). Why try to see if it's Somali if all you have is English, Russian and Czech? See [here](https://gist.github.com/amadanmath/77d9bae747268f97eab2e22e7cd0a364). — Amadan, Aug 28 '18 at 03:33
@Amadan, thank you for your response! You can place your decision in the answers to this question, so that I will accept it as an answer to my question. — lemon, Aug 28 '18 at 10:18

score 2 · Accepted Answer · answered Aug 28 '18 at 10:26

Even though API is not that great, you can make langdetect restrict itself only to languages that you are actually working with. For example:

from langdetect.detector_factory import DetectorFactory, PROFILES_DIRECTORY
import os

def get_factory_for(langs):
    df = DetectorFactory()
    profiles = []
    for lang in ['en', 'ru', 'pl']:
        with open(os.path.join(PROFILES_DIRECTORY, lang), 'r', encoding='utf-8') as f:
            profiles.append(f.read())
    df.load_json_profile(profiles)

    def _detect_langs(text):
        d = df.create()
        d.append(text)
        return d.get_probabilities()

    def _detect(text):
        d = df.create()
        d.append(text)
        return d.detect()

    df.detect_langs = _detect_langs
    df.detect = _detect
    return df

While unrestricted langdetect seems to think "today" is Somali, if you only have English, Russian and Polish you can now do this:

df = get_factory_for(['en', 'ru', 'pl'])
df.detect('today')         # 'en'
df.detect_langs('today')   # [en:0.9999988994459187]

It will still miss a lot ("snow" is apparently Polish), but it will still drastically cut down on your error rate.

Stemming of the multilingual text corpus

1 Answers1