How to filter out unicode characters when webscraping in python?

Question

So, every time I web scrape this webpage on oed.com, I get little apostrophes that appear to be unicode characters. How can I filter through my code and replace all those characters with a normal apostrophe? Below is the code I used to print my list of words (if you're not signed into the site, scraping multiple times will show repeated words).

import csv
import os
import re
import requests
import urllib2

year_start= 1550
year_end = 1560
subject_search = ['Law']

with open("/Applications/Python 3.5/Economic/OED_table.csv", 'a') as outputw, open("/Applications/Python 3.5/Economic/OED.html", 'a') as outputh:  #opens the folder and 'a' adds the words to the csv file.    
for year in range(year_start, year_end +1): 
    path = '/Applications/Python 3.5/Economic'
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    urllib2.install_opener(opener)

    user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    header = {'User-Agent':user_agent}

    resultPath = os.path.join(path, 'OED_table.csv')
    htmlPath = os.path.join(path, 'OED.html')
    request = urllib2.Request('http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='+ str(year)+ '&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='+ str(subject_search)+ '&type=dictionarysearch', None, header)
    page = opener.open(request)

    urlpage = page.read()
    outputh.write(urlpage)

    new_words = re.findall(r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
    print new_words
    csv_writer = csv.writer(outputw)
    if csv_writer.writerow([year] + new_words): 
        csv_writer.writerow([year, word])

After this prints my words, I often get the unicode letters \xcb\x88. For instance, the word un'sentenced prints as 'un\xcb\x88sentenced'.

How do I take all the instances of those unicode letters and replace them with the appropriate apostrophe > ' . I was thinking it'd be something like this,

for word in new_words:
    word = re.sub('[\x00-\x7f]','', word)

But I'm stuck.

Do you want to remove these characters or interpret them correctly as unicode? Have a look at https://docs.python.org/2/howto/unicode.html. If it's possible for you, I'd recommend switching to Python 3 which is much better at handling unicode. — amyrit, Nov 27 '16 at 21:56
@amyrit , I want to basically remove the characters and replace them with the simple keyboard apostrophe character > ' — Kainesplain, Nov 27 '16 at 22:06
did you try `word.replace('xcb\x88', "'")`? This is only solving part of your problem though, I would advise you handling unicode properly. If that's real-life coding, you can't avoid it. — amyrit, Nov 27 '16 at 22:15
@amyrit, Ah, okay. I'm reading through the doc now. I'm an econ major not a comp. sci. student, so a lot is over my head with this stuff in general lol. — Kainesplain, Nov 27 '16 at 22:21
that's ok, unicode is a nightmare for many people. If you can, just use Python 3. — amyrit, Nov 27 '16 at 22:23
A nitpick about terminology: _every_ character is a Unicode character. It looks like what you want to eliminate is all the non-ASCII characters. But the ASCII range (from U+0000 to U+007F) is still part of Unicode. I'd use either a regex or something like [this](http://stackoverflow.com/questions/555705/character-translation-using-python-like-the-tr-command). — Grant McLean, Nov 28 '16 at 21:35

score 0 · Answer 1 · answered Dec 07 '16 at 01:44

about this: After this prints my words, I often get the unicode letters \xcb\x88. For instance, the word un'sentenced prints as 'un\xcb\x88sentenced'.

problem 1: \xcb\x88 is NOT unicode letters (plural). It is ONE character U+02C8 MODIFIER CHARACTER VERTICAL LINE, encoded in UTF-8. The Unicode standard hints that it modifies the following character.

problem 2: un'sentenced is not a word.

You need to ascertain what this gadget means in the original data. My guess is that it is NOT meant to be any kind of apostrophe. So you probably need to delete it.

Highly recommended: Don't delete every non-ASCII character that you encounter. Also read your file, decode the whole file from UTF-8 to unicode, process unicode, finally encode your output data ... don't attempt to process UTF-8 bytes.

How to filter out unicode characters when webscraping in python?

1 Answers1