I have a text corpus with item descriptions in English, Russian and Polish.
This text corpus has 68K observations. Some of these observations are written in English, some in Russian, and some in Polish.
Could you tell me how properly and cost-efficiently implement a word stemming in this case? I can not use an English stemmer on Russian words and vice versa.
Unfortunately, I could not find a good language identifier. E.g. langdetect
works too slow and often incorrectly. For example, I try to identify language of english word 'today':
detect("today")
"so"
# i.e Somali
So far my code implementation looks bad. I just use one stemmer on another:
import nltk
# polish stemmer
from pymorfologik import Morfologik
clean_items = []
# create stemmers
snowball_en = nltk.SnowballStemmer("english")
snowball_ru = nltk.SnowballStemmer("russian")
stemmer_pl = Morfologik()
# loop over each item; create an index i that goes from 0 to the length
# of the item list
for i in range(0, num_items):
# Call our function for each one, and add the result to the list of
# clean items
cleaned = items.iloc[i]
# to word stem
clean_items.append(snowball_ru.stem(stemmer_pl(snowball_en.stem(cleaned))))