So, every time I web scrape this webpage on oed.com, I get little apostrophes that appear to be unicode characters. How can I filter through my code and replace all those characters with a normal apostrophe? Below is the code I used to print my list of words (if you're not signed into the site, scraping multiple times will show repeated words).
import csv
import os
import re
import requests
import urllib2
year_start= 1550
year_end = 1560
subject_search = ['Law']
with open("/Applications/Python 3.5/Economic/OED_table.csv", 'a') as outputw, open("/Applications/Python 3.5/Economic/OED.html", 'a') as outputh: #opens the folder and 'a' adds the words to the csv file.
for year in range(year_start, year_end +1):
path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}
resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
request = urllib2.Request('http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='+ str(year)+ '&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='+ str(subject_search)+ '&type=dictionarysearch', None, header)
page = opener.open(request)
urlpage = page.read()
outputh.write(urlpage)
new_words = re.findall(r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
print new_words
csv_writer = csv.writer(outputw)
if csv_writer.writerow([year] + new_words):
csv_writer.writerow([year, word])
After this prints my words, I often get the unicode letters \xcb\x88. For instance, the word un'sentenced prints as 'un\xcb\x88sentenced'.
How do I take all the instances of those unicode letters and replace them with the appropriate apostrophe > ' . I was thinking it'd be something like this,
for word in new_words:
word = re.sub('[\x00-\x7f]','', word)
But I'm stuck.