I am having trouble understanding some unicode, which I am scraping from web pages that users populate via web forms. At the end of it, I want to use NLTK to tokenize and process the scraped text, but unwanted characters are getting in the way. I am unable to figure out how to remove these.
I start by using selenium webdriver to fetch the web pages, extract text content, and print to a file:
driver.get(URL)
HTML = driver.page_source
soup = bs(HTML) [s.extract()
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
text = soup.getText()
outFile.write(text)
outFile.close()
(The s.extract() comprehension, followed by soup.getText() was recommended by nmgeek in response to a different stackoverflow posting.) The resulting files look good when I use 'cat <outFile>', but show special characters as bullets, •, ® and ™, which turn out to be problematic. (There doubtless are others.)
First, when I try to remove those bullets prior to printing to original outFile, using
clean = re.sub(r'•', r'', text)
outFile.write(clean)
I have no success; the bullets remain.
Second, I read in the <outFile>s thus produced for post-processing with NLTK. Using
raw = open(textFile).read() tokens = nltk.word_tokenize(raw)
the bullets show up in the raw string, and remain as unwanted tokens following the word_tokenize() step. If I print the tokens, these code points(?) appear as '�\x80�'.
I've tried to remove the bullets using a comprehension of the form
words = [w.lower() for w in nltk.Text(tokens) \
if w.lower() not in ['•', '®', '™']]
but the special characters remain.
I doubt that conversion to bytes is the way to solve this, but perhaps there is something helpful in this information. (See Solution / Work-around below, however.) When I encode these as utf-8 and latin-1 the results are:
In [8]: '�\x80�'.encode('utf-8')
Out[8]: b'\xc3\xaf\xc2\xbf\xc2\xbd\xc2\x80\xc3\xaf\xc2\xbf\xc2\xbd'
In [9]: '�\x80�'.encode('latin-1')
Out[9]: b'\xef\xbf\xbd\x80\xef\xbf\xbd'
And just pasting in the bullets found in the text files, I get these byte representations:
In [10]: '•'.encode('utf-8')
Out[10]: b'\xc3\xa2\xc2\x80\xc2\xa2'
In [11]: '•'.encode('latin-1')
Out[11]: b'\xe2\x80\xa2'
Python's repr() function doesn't seem to provide any clarity:
In [35]: repr(tokens[303:309])
Out[35]: "['�\\x80�', 'Wealth', 'of', 'outdoor', 'recreational', 'activities']"
I've been through the Python 3 Unicode Howto a few times, but can't figure out how to make sense of this combined information.