0

I'm encoding my CSV_table from scrapping process like this :

with open("Raw_table.csv", 'w',encoding="utf-8") as outfile:
   csv_writer = csv.writer(outfile, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL,)

Usually, when i want to use them i use a csv_parser like this :

def parse_csv(content, delimiter = ';'):  
  csv_data = []
  for line in content.split('\n'):
    csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
  return csv_data


list_raw=parse_csv(open('Raw_RC.csv','r',encoding="utf-8").read())

It works when i'm scrapping from USA, England website. Here i have to deal with French, Spanish and German things it gives me such error when trying to read from the csv with parse_csv

    csv_writer.writerow([k] + v)
ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)

How can i fix this ?

Subsidiary questions :

  1. Should I encode the CSV, scrap the site another way (e.g set BeautifoulSoup differently) otherwise when it's german or french ?
  2. This encoding problem can be related with all of the \xa0 i get from scrapping ? I don't think so because i'm able to parse UK,USA cdv whereas there are also full of them.

Every bytes of your time you take to solve this is appreciated ! :)

BoobaGump
  • 525
  • 1
  • 6
  • 17
  • 1
    possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – tripleee Aug 10 '15 at 12:48

1 Answers1

2

When working with french/german/spanish character (website written in that language), don't use : encoding='utf-8' but encoding='ISO-8859-1' instead.

So writing :

with open("Raw_table.csv", 'w',encoding="ISO-8859-1") as outfile:
   csv_writer = csv.writer(outfile, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL,)

And reading :

list_raw=parse_csv(open('Raw_RC.csv','r',encoding="ISO-8859-1").read())

The \xa0 problem is not related. Indeed, it occurs only in UTF-8. So my specific french/german typography wasn't related. To go further on this matter (which wasn't the core of the question) please see the following link suggested by tripleee.

Community
  • 1
  • 1
BoobaGump
  • 525
  • 1
  • 6
  • 17