I am writing a program to scrape a Wikipedia table with python. Everything works fine except for some of the characters which seem don't seem to be encoded properly by python.
Here is the code:
import csv
import requests
from BeautifulSoup import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
url = 'https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'wikitable sortable'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
outfile = open("./scrapedata.csv", "wb")
writer = csv.writer(outfile)
print list_of_rows
writer.writerows(list_of_rows)
For example Merzbrück
is being encoded as Merzbrück
.
The issue more or less seems to be with scandics (é,è,ç,à etc). Is there a way I can avoid this?
Thanks in advance for your help.