Preferably, I would use scikit-learn
to build a language ID system as I had previously done, see https://github.com/alvations/bayesline.
That being said it is totally possible to build a language ID system using simple classification modules from NLTK
and unicode data.
There is no need to do anything special to the NLTK code and they can be used as they are. (this might be useful to you as to how to build a classifier in NLTK: nltk NaiveBayesClassifier training for sentiment analysis)
Now to show that it's totally possible to just use NLTK out of the box for language ID with unicode data, see below
Firstly for language ID, there is a minor difference using unicode character feature and bytecode in feature extraction:
from nltk.corpus import indian
# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))
# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')
# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
[out]:
hindi: पूर
bangla: মহি
marathi: '' सन
telugu: 4 . ఆడ
hindi: पूर्ण प्रत
bangla: মহিষের সন্
marathi: '' सनातनवा
telugu: 4 . ఆడిట్
Now that you see the difference in using bytecode and unicode, let's train a taggaer.
from nltk import NaiveBayesClassifier as nbc
# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print
vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))
feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]
classifer = nbc.train(feature_set)
test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu
for testdoc in [test1, test2, test3, test4]:
featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary}
print "test sent:", testdoc
print "tag:", classifer.classify(featurized_test_sent)
print
[out]:
[(u'4', u' '), (u' ', u'.'), (u'.', u' '), (u' ', u'\u0c06'), (u'\u0c06', u'\u0c21'), (u'\u0c21', u'\u0c3f'), (u'\u0c3f', u'\u0c1f'), (u'\u0c1f', u'\u0c4d'), (u'\u0c4d', u' ')]
[(u'\u092a', u'\u0942', u'\u0930'), (u'\u0942', u'\u0930', u'\u094d'), (u'\u0930', u'\u094d', u'\u0923'), (u'\u094d', u'\u0923', u' '), (u'\u0923', u' ', u'\u092a'), (u' ', u'\u092a', u'\u094d'), (u'\u092a', u'\u094d', u'\u0930'), (u'\u094d', u'\u0930', u'\u0924')]
test sent: पूर्ण प्रत
tag: hi
test sent: মহিষের সন্
tag: ba
test sent: सनातनवा
tag: ma
test sent: ఆడిట్
tag: te
Here's the full code:
# -*- coding: utf-8 -*-
from itertools import chain
from nltk.corpus import indian
from nltk.util import ngrams
from nltk import NaiveBayesClassifier as nbc
# NLTK reads the corpus as bytecodes.
hindi = " ".join(indian.words('hindi.pos'))
bangla = " ".join(indian.words('bangla.pos'))
marathi = " ".join(indian.words('marathi.pos'))
telugu = " ".join(indian.words('telugu.pos'))
# Prints out first 10 bytes (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
# Converts bytecodes to utf8.
hindi = hindi.decode('utf8')
bangla = bangla.decode('utf8')
marathi = marathi.decode('utf8')
telugu = telugu.decode('utf8')
# Prints out first 10 unicode char (including spaces).
print 'hindi:', hindi[:10]
print 'bangla:', bangla[:10]
print 'marathi:', marathi[:10]
print 'telugu:', telugu[:10]
print
# Allocate some sort of labels for the data.
training = [(hindi, 'hi'), (bangla, 'ba'), (marathi, 'ma'), (telugu, 'te')]
# This is how you can extract ngrams
print ngrams(telugu[:10], 2)
print
print ngrams(hindi[:10], 3)
print
vocabulary = set(chain(*[ngrams(txt, 2) for txt,tag in training]))
feature_set = [({i:(i in ngrams(sentence, 2)) for i in vocabulary},tag) for sentence, tag in training]
classifer = nbc.train(feature_set)
test1 = u'पूर्ण प्रत' # hindi
test2 = u'মহিষের সন্' # bangla
test3 = u'सनातनवा' # marathi
test4 = u'ఆడిట్ ' # telugu
for testdoc in [test1, test2, test3, test4]:
featurized_test_sent = {i:(i in ngrams(testdoc,2)) for i in vocabulary}
print "test sent:", testdoc
print "tag:", classifer.classify(featurized_test_sent)
print