python3 How do I identify / handle unwanted unicode from scraped web page?

Question

I am having trouble understanding some unicode, which I am scraping from web pages that users populate via web forms. At the end of it, I want to use NLTK to tokenize and process the scraped text, but unwanted characters are getting in the way. I am unable to figure out how to remove these.

I start by using selenium webdriver to fetch the web pages, extract text content, and print to a file:

driver.get(URL)
HTML = driver.page_source
soup = bs(HTML) [s.extract()
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
text = soup.getText()
outFile.write(text)
outFile.close()

(The s.extract() comprehension, followed by soup.getText() was recommended by nmgeek in response to a different stackoverflow posting.) The resulting files look good when I use 'cat <outFile>', but show special characters as bullets, •, ® and ™, which turn out to be problematic. (There doubtless are others.)

First, when I try to remove those bullets prior to printing to original outFile, using

clean = re.sub(r'•', r'', text)
outFile.write(clean)

I have no success; the bullets remain.

Second, I read in the <outFile>s thus produced for post-processing with NLTK. Using

raw = open(textFile).read() tokens = nltk.word_tokenize(raw)

the bullets show up in the raw string, and remain as unwanted tokens following the word_tokenize() step. If I print the tokens, these code points(?) appear as '�\x80�'.

I've tried to remove the bullets using a comprehension of the form

words = [w.lower() for w in nltk.Text(tokens) \
         if w.lower() not in ['•', '®', '™']]

but the special characters remain.

I doubt that conversion to bytes is the way to solve this, but perhaps there is something helpful in this information. (See Solution / Work-around below, however.) When I encode these as utf-8 and latin-1 the results are:

In [8]: '�\x80�'.encode('utf-8')
Out[8]: b'\xc3\xaf\xc2\xbf\xc2\xbd\xc2\x80\xc3\xaf\xc2\xbf\xc2\xbd'
In [9]: '�\x80�'.encode('latin-1')
Out[9]: b'\xef\xbf\xbd\x80\xef\xbf\xbd'

And just pasting in the bullets found in the text files, I get these byte representations:

In [10]: '•'.encode('utf-8')
Out[10]: b'\xc3\xa2\xc2\x80\xc2\xa2'
In [11]: '•'.encode('latin-1')
Out[11]: b'\xe2\x80\xa2'

Python's repr() function doesn't seem to provide any clarity:

In [35]: repr(tokens[303:309])
Out[35]: "['�\\x80�', 'Wealth', 'of', 'outdoor', 'recreational', 'activities']"

I've been through the Python 3 Unicode Howto a few times, but can't figure out how to make sense of this combined information.

It might provide a bit of extra clarity if you post `repr(text)`, or at least the part of it which contains the problematic characters. This will tell us exactly what bytes or str you are dealing with. — unutbu, Aug 05 '14 at 20:16
That's really odd. I expected the `repr` to return a `str` of printable characters, not one with `�`s in it. — unutbu, Aug 05 '14 at 21:49
Encoding issues are the usual problem with Python, especially when running on Windows. Two recommendations: 1) try first to see whether there is encoding mentioned in the webpage head under charset - that will tell you how to decode contents; and 2) if you are sure about the encoding, always use decode - e.g. raw.decode('utf-8', 'ignore') - NB: not ENcode but DEcode. Finally, if you are sure it's utf-8 and want to process those Unicode symbols legible, best to use unidecode module to turn these into ASCII where possible. Must help not to lose much if scraping non-English webpages. — Everst, Aug 06 '14 at 00:45
@Everst in case this clarifies things, I am running under linux. I don't see encoding information on the original web pages. (But would this matter if users were, eg., pasting content into a text box from their Word app?) I am getting "'str' object has no attribute 'decode'", which leads me to wonder if you are treating these as Python 2.x strings(?) ... I'll see if I can make something out of unidecode. — user3897315, Aug 06 '14 at 02:06
I see, makes sense. Following your comment, I was reading sth on Python 3 strings here - http://www.pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/ - and noticed that Unicode characters are not written well into file - says one'd better do outFile.write(clean.encode()). Another thing occurred to me is that NLTK itself can be doing some conversions internally in order to turn strings into lists but I am not sure about the details of their Text class or tokenizer implementations. In 2.x str is ASCII by default but is different in 3 - maybe NLTK mixes things somewhere here?.. — Everst, Aug 06 '14 at 03:00
OK, it looks like I have a solution/work-around, and it's not entirely obvious. Added to bottom of original posting. Thanks @Everst and unutbu for your suggestions, which encouraged me to keep looking for the solution. — user3897315, Aug 06 '14 at 18:57

score 0 · Accepted Answer · answered Aug 06 '14 at 20:52

After tokenizing the raw text as uglyTokens, I removed the offending characters using:

tokens = [t for t in uglyTokens if t.encode('utf-8') not in \
          [b'\xc3\xa2\xc2\x80\xc2\xa2', \
           b'\xc3\xa2\xc2\x80\xc2\x99']]

I found the appropriate byte forms by listing uglyTokens, identifying examples of undesireables, and then using

In [21]: uglyTokens[2841].encode('utf-8')
Out[21]:b'\xc3\xa2\xc2\x80\xc2\xa2'

Note that this case matches

'•'.encode('utf-8')

in the original posting.

python3 How do I identify / handle unwanted unicode from scraped web page?

1 Answers1