Python script chokes on a downloaded file because of unicode encode error

Question

I run a script 4 times a day that uses the requests module to download a file, which I then throw into a database. 9 times out of 10, the script works flawlessly. But the times it does not work is because of a character in the downloaded file that my script, as it is, does not like. For example, here's the error I got today: UnicodeEncodeError: 'ascii' codec can't encode characters in position 379-381: ordinal not in range(128). I downloaded the file another way and here's the character at position 380 which I believe is responsible for stopping my script, "∞". And, here's the place in my script where it chokes:

##### request file

r = requests.get('https://resources.example.com/requested_file.csv')

##### create the database importable csv file

ld = open('/requested_file.csv', 'w')
print(r.text, file=ld)

I know this probably has to do with encoding the file somehow before printing it to the .csv file, and is probably a simple thing for someone who knows what they are doing but, after many hours of research, I'm about to cry. Thanks for your help in advance!

Related: http://stackoverflow.com/questions/17856610/python-3-unicode-encode-error — Holloway, Mar 19 '15 at 15:59
You'll need to know the encoding ([see here](https://docs.python.org/2/library/codecs.html#standard-encodings)). Does whoever posts the csv files tell you what to use? — tdelaney, Mar 19 '15 at 16:07
@tdelaney this looks like python3 from the print function - so [these docs](https://docs.python.org/3/library/codecs.html#standard-encodings). Not sure if anything changed. — Holloway, Mar 19 '15 at 16:17

Martijn Pieters · Answer 1 · 2015-03-19T16:23:24.320

0

You need to provide an encoding for your file; currently it defaults to ASCII, which is a very limited codec.

You could use UTF-8 instead, for example:

with open('/requested_file.csv', 'w', encoding='utf8') as ld:
    print(r.text, file=ld)

However, since you are loading from a URL you are now decoding then encoding again. A better idea is to just copy the data straight to disk as bytes. Make a streaming request and have shutil.copyfileobj() copy the data in chunks. That way you can handle any size of response without loading everything into memory:

import requests
import shutil

r = requests.get('https://resources.example.com/requested_file.csv', stream=True)
with open('/requested_file.csv', 'wb') as ld:
    r.raw.decode_content = True  # decompress gzip or deflate responses
    shutil.copyfileobj(r.raw, ld)

edited Mar 19 '15 at 16:23

answered Mar 19 '15 at 16:01

Martijn Pieters

1,048,767
296
4,058
3,343

Did this but now get a different but similar error: "UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 379: ordinal not in range(128)" – Jeff F Mar 19 '15 at 16:17
@JeffF: sounds like you have a new question; something is trying to decode bytes as ASCII. For your posted code that could mean the server told `requests` that the text data was encoded using `ASCII` but in reality it is not. `C3` is not a valid ASCII codepoint. – Martijn Pieters Mar 19 '15 at 16:23
@JeffF: since you are copying URL data straight to a file, better to open the file in binary mode and just copy the data across straight. No decoding, no encoding. – Martijn Pieters Mar 19 '15 at 16:24

score 0 · Accepted Answer · answered Apr 24 '15 at 00:04

I tried a lot of different things but here's what ended up working for me:

import requests
import io

##### request file

r = requests.get('https://resources.example.com/requested_file.csv')

##### create the db importable csv file

with open('requested_file_TEMP.csv', 'wb') as ld:
ld.write(r.text.encode())
ld.close()

##### run the temp file through the following code to get rid of any non-ascii characters
##### in the file; non-ascii characters can/will cause the script to choke

with io.open('requested_file_TEMP.csv', 'r',encoding='utf-8',errors='ignore') as infile, \
io.open('requested_file_TEMP.csv', 'w',encoding='ascii',errors='ignore') as outfile:
for line in infile:
    print(*line.split(), file=outfile)
infile.close
outfile.close

Python script chokes on a downloaded file because of unicode encode error

2 Answers2