python based naive base classifer for new language

Question

I am not trying to build a whole new naive bayes classifier. There are plenty already for example scitkit learn has Naive Bayes implementation, NLTK has its own NaiveBayesClassifier.

I have 1000+ sentences for training and 300+ sentences for test set in my language (one of Indic language). All I need to do is pick up one of the classifier (Naive Bayes implemented), train it and test its accuracy.

The problem is texts aren't in English its in Devnagari unicode.

I am seeking for suggestions on which Classifier well fits to cover up the main issue I am having so far is with unicode.

Did you try any of those classifiers? They will probably work on unicode data. — BrenBarn, Jul 28 '14 at 05:02
I used this https://github.com/codebox/bayesian-classifier @BrenBarn but training set in unicode wasn't taken. Resulted in "no text found" error. — Chandan Gupta, Jul 28 '14 at 05:20
possible duplicate of [NLTK and language detection](http://stackoverflow.com/questions/3182268/nltk-and-language-detection) — emiguevara, Jul 29 '14 at 14:29
try adapting this code for language ID, https://github.com/alvations/bayesline ;) — alvas, Aug 03 '14 at 22:11

score 4 · Answer 1 · answered Jul 28 '14 at 05:13

The naive bayes in scikit-learn operate with number vectors, that (for example) we can get after some vectorizer. For text classification I often use TfidfVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In parameters for constructor TfidfVectorizer exists next parameter: encoding : string, ‘utf-8’ by default. If bytes or files are given to analyze, this encoding is used to decode.

You can use this parameter and use your encoding, also you can specify your own preprocessor function and analyze function (it also can be useful)

score 1 · Accepted Answer · edited May 23 '17 at 12:15

Preferably, I would use scikit-learn to build a language ID system as I had previously done, see https://github.com/alvations/bayesline.

That being said it is totally possible to build a language ID system using simple classification modules from NLTK and unicode data.

There is no need to do anything special to the NLTK code and they can be used as they are. (this might be useful to you as to how to build a classifier in NLTK: nltk NaiveBayesClassifier training for sentiment analysis)

Now to show that it's totally possible to just use NLTK out of the box for language ID with unicode data, see below

Firstly for language ID, there is a minor difference using unicode character feature and bytecode in feature extraction:

from nltk.corpus import indian

# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))

# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')

# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

[out]:

hindi: पूर
bangla: মহি
marathi: '' सन
telugu: 4 . ఆడ

hindi: पूर्ण प्रत
bangla: মহিষের সন্
marathi: '' सनातनवा
telugu: 4 . ఆడిట్

Now that you see the difference in using bytecode and unicode, let's train a taggaer.

from nltk import NaiveBayesClassifier as nbc

# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print

vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))

feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]

classifer = nbc.train(feature_set)

test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu

for testdoc in [test1, test2, test3, test4]:
    featurized_test_sent =  {i:(i in ngrams(testdoc,2)) for i in vocabulary}
    print "test sent:", testdoc
    print "tag:", classifer.classify(featurized_test_sent)
    print

[out]:

[(u'4', u' '), (u' ', u'.'), (u'.', u' '), (u' ', u'\u0c06'), (u'\u0c06', u'\u0c21'), (u'\u0c21', u'\u0c3f'), (u'\u0c3f', u'\u0c1f'), (u'\u0c1f', u'\u0c4d'), (u'\u0c4d', u' ')]

[(u'\u092a', u'\u0942', u'\u0930'), (u'\u0942', u'\u0930', u'\u094d'), (u'\u0930', u'\u094d', u'\u0923'), (u'\u094d', u'\u0923', u' '), (u'\u0923', u' ', u'\u092a'), (u' ', u'\u092a', u'\u094d'), (u'\u092a', u'\u094d', u'\u0930'), (u'\u094d', u'\u0930', u'\u0924')]

test sent: पूर्ण प्रत
tag: hi

test sent: মহিষের সন্
tag: ba

test sent: सनातनवा
tag: ma

test sent: ఆడిట్ 
tag: te

Here's the full code:

# -*- coding: utf-8 -*-

from itertools import chain
from nltk.corpus import indian
from nltk.util import ngrams
from nltk import NaiveBayesClassifier as nbc


# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))

# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')

# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print

vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))

feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]

classifer = nbc.train(feature_set)

test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu

for testdoc in [test1, test2, test3, test4]:
    featurized_test_sent =  {i:(i in ngrams(testdoc,2)) for i in vocabulary}
    print "test sent:", testdoc
    print "tag:", classifer.classify(featurized_test_sent)
    print

although its exactly what I am looking for. I can take it forward it from here. — Chandan Gupta, Oct 10 '14 at 06:34

score 0 · Answer 3 · answered Jul 29 '14 at 07:57

0

The question is very poorly formulated, but there is a possibility that it might be about language identification rather than sentence classification.

If this is the case, then there is a long way to go before you apply anything like Naive Bayes or other classifiers. Have a look at the character-gram approach used by Damir Cavar's LID, implemented in Python.

answered Jul 29 '14 at 07:57

emiguevara

1,359
13
26

yep, its not about sentence classification. its about language identification by machine learning module available. – Chandan Gupta Jul 29 '14 at 13:29
ok, then my suggestion is to read this question: http://stackoverflow.com/questions/3182268/nltk-and-language-detection – emiguevara Jul 29 '14 at 14:24
Basically, you will do the identification somewhere else, not in NLTK or Scikit-learn. You can plug the various statistical models from these libraries in the decision function of any identification solution, after Unicode and the character-grams have been dealt with. – emiguevara Jul 29 '14 at 14:27
will have a look at it and get back here. – Chandan Gupta Jul 30 '14 at 02:40

python based naive base classifer for new language

3 Answers3