I have a collection of text which has sentences either entirely in English or Hindi or Marathi with ids attached to each of these sentences as 0,1,2 respectively representing the language of the text.
The text irrespective of any language may have HTML tags, punctuation etc.
I could clean the English sentences using my code below:
import HTMLParser
import re
from nltk.corpus import stopwords
from collections import Counter
import pickle
from string import punctuation
#creating html_parser object
html_parser = HTMLParser.HTMLParser()
cachedStopWords = set(stopwords.words("english"))
def cleanText(text,lang_id):
if lang_id == 0:
str1 = ''.join(text).decode('iso-8859-1')
else:
str1 = ''.join(text).encode('utf-8')
str1 = html_parser.unescape(str1)
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', str1)
#print "cleantext before puncts removed : " + cleantext
clean_puncts = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
cleantext = re.sub(clean_puncts,' ',cleantext)
#print " cleantext after puncts removed : " + cleantext
cleanest = cleantext.lower()
if lang_id == 0:
cleanertext = ' '.join([word for word in cleanest.split() if word not in cachedStopWords])
words = re.findall(r"[\w']+", cleanertext)
words_final = [x.encode('UTF8') for x in words]
else:
words_final = cleanest.split()
return words_final
but it gives me the following error for Hindi and Marathi text as :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 104: ordinal not in range(128)
also, it removes all the words.
Hindi text is like
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>
How can I do the same for Hindi or Marathi text?