How to deal with problematic encoding while webscraping?

Question

I am trying to scrape and merge the contents of multiple tables each on a separate webpage. I have just read a lot about encoding and unicode including all their links, but can't figure out if I've missed something or if there is problem with the encoding on the webpage. In the first link, you can see that for the date 10/31/2014 the Brand Name column reads "Pearâ€™s Gourmet", but a lot of other strings come out with funny apostrophes like "Children’s Medical Ventures, LLC" (instead of "Children's...). I can see the funny apostrophes in IPython, but they just come out in the csv file as â€™.

My questions are:

Am I doing something wrong with the encoding to get the apostrophes coming out wrong?
If not, how do I replace the wrong characters with an apostrophe?

I have tried to make reproducible code below.

#Import libraries
import sys
#import IPython
print(sys.version_info[0:30])
      #python 2.7.11
#print(IPython.version_info)
      #IPython 4.0.1
import pandas as pd
from bs4 import BeautifulSoup
#from lxml import html
import requests
import os
cwd = os.getcwd()

#Generate dataframe and lists
df = pd.DataFrame()
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]

#Scrape the number of separate webpages that contain tables for a given year
pstr1 = "http://www.fda.gov/Safety/Recalls/ArchiveRecalls/"    
#for i in range(2006,2017):
for i in range(2014,2015):  
    a = ["/default.htm","/default.htm?Page="]
    pagename = pstr1 + str(i) + a[0]
    print pagename
    r = requests.get(pagename)
    r.raise_for_status()
    #print(page.encoding)
    r.encoding = 'utf-8'
    page = BeautifulSoup(r.text)
    nPages = page.select('.pagination-clean a') 

    #Scrape the data from each table and combine it into a dataframe
    for j in range(len(nPages)):
        pagename = pstr1 + str(i) + a[1] + str(j+1)
        print pagename
        r = requests.get(pagename)
        r.encoding = 'utf-8'
        soup = BeautifulSoup(r.text)
        T1=soup.find('table')

        for row in T1.findAll("tr"):
            cells = row.findAll('td')

            if len(cells)!=0: #ignore heading 
                A.append(cells[0].find(string=True))
                B.append(cells[1].find(string=True))
                C.append(cells[2].find(string=True))
                D.append(cells[3].find(string=True))
                E.append(cells[4].find(string=True))
                F.append(cells[5].find(string=True))

                #Examine the problematic characters
                try:
                    cells[1].find(string=True).decode('utf-8')
                    #print "string is UTF-8, length %d bytes" % len(cells[1].find(string=True))
                except UnicodeError:
                    print "string is not UTF-8"
                    #print(cells[1].find(string=True))

df=pd.DataFrame(A, columns=['Date'])
df['Brand_Name']=B
df['Product_Description']=C
df['Reason_Problem']=D
df['Company']=E
df['Details_Photo']=F
df.to_csv(cwd+'/Table1.csv', encoding='utf-8')

Why are you calling decode on a unicode string? Also let requests and bs4 handle the encoding, `soup = BeautifulSoup(r.content)` — Padraic Cunningham, Aug 23 '16 at 23:02
Why am I calling decode in the `try` statement? Yeah, it doesn't really make sense to call decode on a unicode string but it catches the strings that come out weird in `to_csv()`. When I don't explicitly set the encoding for requests, and use `soup = BeautifulSoup(r.content)` the same strings have messed up apostrophes in the .csv file. — greg44289, Aug 24 '16 at 12:19

How to deal with problematic encoding while webscraping?

0 Answers0