0

While Doing the count vectorization in Hindi, features names are getting automatically stemmed.

from sklearn.feature_extraction.text import CountVectorizer
test = []
test.append("हमें फिल्म बहुत अच्छी लगी ।")
test.append("फिल्म में कुछ बेहतरीन गाने हैं ।")
cv = CountVectorizer().fit(test)
print(cv.get_feature_names())

output: ['अच', 'बह', 'लग', 'हतर', 'हम']

1 Answers1

0

The analyzer used by CountVectorizer() seems to badly support some encodings. You can define a custom analyzer, to define how to separate the words. To separate the words properly, you can use a regex:

import regex 

def custom_analyzer(text):
    words = regex.findall(r'\w{2,}', text) # extract words of at least 2 letters
    for w in words:
        yield w

test = []
test.append("हमें फिल्म बहुत अच्छी लगी ।")
test.append("फिल्म में कुछ बेहतरीन गाने हैं ।")
count_vect = CountVectorizer(analyzer = custom_analyzer)
xv = count_vect.fit_transform(test)
count_vect.get_feature_names()

I used the regex module because it supports more encodings than the module re (Thanks to this answer for explaining).

XavierBrt
  • 1,179
  • 8
  • 13