To clean text belonging to different languages in Python

Question

I have a collection of text which has sentences either entirely in English or Hindi or Marathi with ids attached to each of these sentences as 0,1,2 respectively representing the language of the text.

The text irrespective of any language may have HTML tags, punctuation etc.

I could clean the English sentences using my code below:

import HTMLParser
import re
from nltk.corpus import stopwords
from collections import Counter
import pickle
from string import punctuation

#creating html_parser object 
html_parser = HTMLParser.HTMLParser()
cachedStopWords = set(stopwords.words("english"))

def cleanText(text,lang_id): 

    if lang_id == 0:
        str1 = ''.join(text).decode('iso-8859-1')


    else:
        str1 = ''.join(text).encode('utf-8')

    str1 = html_parser.unescape(str1)    
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', str1)
    #print "cleantext before puncts removed : " + cleantext
    clean_puncts = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
    cleantext = re.sub(clean_puncts,' ',cleantext)
    #print " cleantext after puncts removed : " + cleantext
    cleanest = cleantext.lower()
    if lang_id == 0:
        cleanertext = ' '.join([word for word in cleanest.split() if word not in cachedStopWords])       
        words = re.findall(r"[\w']+", cleanertext)
        words_final = [x.encode('UTF8') for x in words]
    else:
        words_final = cleanest.split()
    return words_final

but it gives me the following error for Hindi and Marathi text as :

UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 104: ordinal not in range(128)

also, it removes all the words.

Hindi text is like

&lt;p&gt;भारत का इतिहास काफी समृद्ध एवं विस्तृत है। &lt;/p&gt;

How can I do the same for Hindi or Marathi text?

are different `line and parts of line`. prepare language string before encoding. fist `unicode` later `str`. — dsgdfg, Feb 12 '16 at 09:10
Can you show the full code? The answer depends a lot on how you read the file. — alvas, Feb 12 '16 at 11:34

score 1 · Answer 1 · edited May 23 '17 at 11:50

Without the full textfile, the solution that we can provide will only be a shot in the dark.

Firstly, check the types of the strings you're reading into the cleanText(), it is really a unicode or is it a byte string? See byte string vs. unicode string. Python

So if you've read your file properly and ensures that everything is unicode, there should be no problem in how you manage the strings (in both python2 or 3). The following example confirms this:

>>> from HTMLParser import HTMLParser
>>> hp = HTMLParser()
>>> text = u"&lt;p&gt;भारत का इतिहास काफी समृद्ध एवं विस्तृत है। &lt;/p&gt;"
>>> hp.unescape(text)
u'<p>\u092d\u093e\u0930\u0924 \u0915\u093e \u0907\u0924\u093f\u0939\u093e\u0938 \u0915\u093e\u092b\u0940 \u0938\u092e\u0943\u0926\u094d\u0927 \u090f\u0935\u0902 \u0935\u093f\u0938\u094d\u0924\u0943\u0924 \u0939\u0948\u0964 </p>'
>>> print hp.unescape(text)
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>
>>> hp.unescape(text).split()
[u'<p>\u092d\u093e\u0930\u0924', u'\u0915\u093e', u'\u0907\u0924\u093f\u0939\u093e\u0938', u'\u0915\u093e\u092b\u0940', u'\u0938\u092e\u0943\u0926\u094d\u0927', u'\u090f\u0935\u0902', u'\u0935\u093f\u0938\u094d\u0924\u0943\u0924', u'\u0939\u0948\u0964', u'</p>']
>>> print " ".join(hp.unescape(text).split())
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>

Even with regex manipulation, there's no problem:

>>> import re
>>> from string import punctuation
>>> p = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
>>> new_text = " ".join(hp.unescape(text).split())
>>> re.sub(p,' ', new_text)
u' p \u092d\u093e\u0930\u0924 \u0915\u093e \u0907\u0924\u093f\u0939\u093e\u0938 \u0915\u093e\u092b\u0940 \u0938\u092e\u0943\u0926\u094d\u0927 \u090f\u0935\u0902 \u0935\u093f\u0938\u094d\u0924\u0943\u0924 \u0939\u0948\u0964 p '
>>> print re.sub(p,' ', new_text)
 p भारत का इतिहास काफी समृद्ध एवं विस्तृत है। p

Take a look at "How to stop the pain" and following the best-practices in this talk would most likely resolve your unicode problems. Slides on http://nedbatchelder.com/text/unipain.html .

Do also look at this too: https://www.youtube.com/watch?v=Mx70n1dL534 on PyCon14 (but only applicable for python2.x)

Opening a utf8 file like this might resolve your problem too:

import io
with io.open('myfile.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        clean_text(line)

If the STDIN and STDOUT is giving you problem, see https://daveagp.wordpress.com/2010/10/26/what-a-character/

To clean text belonging to different languages in Python

1 Answers1

Linked