0

I am writing a program to scrape a Wikipedia table with python. Everything works fine except for some of the characters which seem don't seem to be encoded properly by python.

Here is the code:

import csv
import requests
from BeautifulSoup import BeautifulSoup
import sys

reload(sys)
sys.setdefaultencoding( "utf-8" )

url = 'https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'wikitable sortable'})

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace(' ', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

outfile = open("./scrapedata.csv", "wb")
writer = csv.writer(outfile)
print list_of_rows
writer.writerows(list_of_rows)

For example Merzbrück is being encoded as Merzbrück. The issue more or less seems to be with scandics (é,è,ç,à etc). Is there a way I can avoid this? Thanks in advance for your help.

Tauseef Hussain
  • 1,049
  • 4
  • 15
  • 29

1 Answers1

1

This is of course an encoding issue. The question is where it is. My suggestion is that you work through each step and look at the raw data to see if you can find out where exactly the encoding issue is.

So, for example, print response.content to see if the symbols are as you expect in the requests object. If so, move on, and check out soup.prettify() to see if the BeautifulSoup object looks ok, then list_of_rows, etc.

All that said, my suspicion is that the issue has to do with writing to csv. The csv documentation has an example of how to write unicode to csv. This answer also might help you with the problem.


For what it's worth, I was able to write the proper symbols to csv using the pandas library (I'm using python 3 so your experience or syntax may be a little different since it looks like you are using python 2):

import pandas as pd

df = pd.DataFrame(list_of_rows)
df.to_csv('scrapedata.csv', encoding='utf-8')
dagrha
  • 2,449
  • 1
  • 20
  • 21