1

I am not trying to build a whole new naive bayes classifier. There are plenty already for example scitkit learn has Naive Bayes implementation, NLTK has its own NaiveBayesClassifier.

I have 1000+ sentences for training and 300+ sentences for test set in my language (one of Indic language). All I need to do is pick up one of the classifier (Naive Bayes implemented), train it and test its accuracy.

The problem is texts aren't in English its in Devnagari unicode.

I am seeking for suggestions on which Classifier well fits to cover up the main issue I am having so far is with unicode.

Chandan Gupta
  • 1,410
  • 2
  • 13
  • 29

3 Answers3

4

The naive bayes in scikit-learn operate with number vectors, that (for example) we can get after some vectorizer. For text classification I often use TfidfVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In parameters for constructor TfidfVectorizer exists next parameter: encoding : string, ‘utf-8’ by default. If bytes or files are given to analyze, this encoding is used to decode.

You can use this parameter and use your encoding, also you can specify your own preprocessor function and analyze function (it also can be useful)

Simplex
  • 1,723
  • 2
  • 17
  • 26
1

Preferably, I would use scikit-learn to build a language ID system as I had previously done, see https://github.com/alvations/bayesline.

That being said it is totally possible to build a language ID system using simple classification modules from NLTK and unicode data.

There is no need to do anything special to the NLTK code and they can be used as they are. (this might be useful to you as to how to build a classifier in NLTK: nltk NaiveBayesClassifier training for sentiment analysis)

Now to show that it's totally possible to just use NLTK out of the box for language ID with unicode data, see below

Firstly for language ID, there is a minor difference using unicode character feature and bytecode in feature extraction:

from nltk.corpus import indian

# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))

# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')

# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

[out]:

hindi: पूर
bangla: মহি
marathi: '' सन
telugu: 4 . ఆడ

hindi: पूर्ण प्रत
bangla: মহিষের সন্
marathi: '' सनातनवा
telugu: 4 . ఆడిట్ 

Now that you see the difference in using bytecode and unicode, let's train a taggaer.

from nltk import NaiveBayesClassifier as nbc

# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print

vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))

feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]

classifer = nbc.train(feature_set)

test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu

for testdoc in [test1, test2, test3, test4]:
    featurized_test_sent =  {i:(i in ngrams(testdoc,2)) for i in vocabulary}
    print "test sent:", testdoc
    print "tag:", classifer.classify(featurized_test_sent)
    print

[out]:

[(u'4', u' '), (u' ', u'.'), (u'.', u' '), (u' ', u'\u0c06'), (u'\u0c06', u'\u0c21'), (u'\u0c21', u'\u0c3f'), (u'\u0c3f', u'\u0c1f'), (u'\u0c1f', u'\u0c4d'), (u'\u0c4d', u' ')]

[(u'\u092a', u'\u0942', u'\u0930'), (u'\u0942', u'\u0930', u'\u094d'), (u'\u0930', u'\u094d', u'\u0923'), (u'\u094d', u'\u0923', u' '), (u'\u0923', u' ', u'\u092a'), (u' ', u'\u092a', u'\u094d'), (u'\u092a', u'\u094d', u'\u0930'), (u'\u094d', u'\u0930', u'\u0924')]

test sent: पूर्ण प्रत
tag: hi

test sent: মহিষের সন্
tag: ba

test sent: सनातनवा
tag: ma

test sent: ఆడిట్ 
tag: te

Here's the full code:

# -*- coding: utf-8 -*-

from itertools import chain
from nltk.corpus import indian
from nltk.util import ngrams
from nltk import NaiveBayesClassifier as nbc


# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))

# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')

# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print

# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print

vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))

feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]

classifer = nbc.train(feature_set)

test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu

for testdoc in [test1, test2, test3, test4]:
    featurized_test_sent =  {i:(i in ngrams(testdoc,2)) for i in vocabulary}
    print "test sent:", testdoc
    print "tag:", classifer.classify(featurized_test_sent)
    print
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
0

The question is very poorly formulated, but there is a possibility that it might be about language identification rather than sentence classification.

If this is the case, then there is a long way to go before you apply anything like Naive Bayes or other classifiers. Have a look at the character-gram approach used by Damir Cavar's LID, implemented in Python.

emiguevara
  • 1,359
  • 13
  • 26
  • yep, its not about sentence classification. its about language identification by machine learning module available. – Chandan Gupta Jul 29 '14 at 13:29
  • ok, then my suggestion is to read this question: http://stackoverflow.com/questions/3182268/nltk-and-language-detection – emiguevara Jul 29 '14 at 14:24
  • Basically, you will do the identification somewhere else, not in NLTK or Scikit-learn. You can plug the various statistical models from these libraries in the decision function of any identification solution, after Unicode and the character-grams have been dealt with. – emiguevara Jul 29 '14 at 14:27
  • will have a look at it and get back here. – Chandan Gupta Jul 30 '14 at 02:40