0

i'm extracting data but some special characters will causing an error

from unicodedata import normalize


import codecs
import csv
import urllib2
import requests
from BeautifulSoup import BeautifulSoup

url = 'https://www.ratebeer.com/top'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody')

list_of_rows = []


for row in table.findAll('tr'):
list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

outfile = open("./top50.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)

trying to extract a csv to import to excel with 50 top beer, rank,name,style,brewery, rating

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • 1
    You should not open the `outfile` as binary and set an appropriate `encoding` as parameter. – Michael Butscher May 01 '19 at 01:18
  • Possible duplicate of [Read and Write CSV files including unicode with Python 2.7](https://stackoverflow.com/questions/17245415/read-and-write-csv-files-including-unicode-with-python-2-7) – snakecharmerb May 01 '19 at 07:03

2 Answers2

0

This is working, python 3.6, defined parser features="lxml", and encoding encoding='utf-8':

import codecs, csv, urlib, requests
from unicodedata import normalize
from bs4 import BeautifulSoup

url = 'https://www.ratebeer.com/top'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, features="lxml")
table = soup.find('tbody')

list_of_rows = []

for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

outfile = open("./top50.csv", "w", encoding='utf-8')
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
Mahmoud Elshahat
  • 1,873
  • 10
  • 24
0

Consider using pandas? You can specify encoding which handles the characters encoding='utf-8-sig'.

import pandas as pd
import requests
r = requests.get('https://www.ratebeer.com/top', headers = {'User-Agent' : 'Mozilla/5.0'})
table = pd.read_html(r.text)[0]
table.drop(['Unnamed: 5'], axis=1, inplace = True)
table.columns = ['Rank', 'Name', 'Count', 'Abv', 'Score']
table.to_csv(r"C:\Users\User\Desktop\Data.csv", sep=',', encoding='utf-8-sig',index = False ) 
QHarr
  • 83,427
  • 12
  • 54
  • 101