42

Trying to create a twitter bot that reads lines and posts them. Using Python3 and tweepy, via a virtualenv on my shared server space. This is the part of the code that seems to have trouble:

#!/foo/env/bin/python3

import re
import tweepy, time, sys

argfile = str(sys.argv[1])

filename=open(argfile, 'r')
f=filename.readlines()
filename.close()

this is the error I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)

The error specifically points to f=filename.readlines() as the source of the error. Any idea what might be wrong? Thanks.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
r_e_cur
  • 457
  • 1
  • 4
  • 8
  • 2
    [See this post](http://stackoverflow.com/questions/34837421/python-script-receiving-a-unicodeencodeerror-ascii-codec-cant-encode-charact), it has two really helpful answers you should try. – Kevin Jan 27 '16 at 11:32
  • 3
    I have used the encoding encoding='iso-8859-1', It solved my problem – hsinghal Feb 02 '17 at 07:18
  • 4
    @hsinghal: ISO-8859-1 (aka latin-1) will always work, but it's often *wrong*. The problem is that it *can* decode any byte from any encoding, but if the original text isn't really latin-1, it's going to decode to garbage. You *need* to know the real encoding, not just guess; UTF-8 is mostly self-checking, so it's unlikely to decode binary gibberish, but latin-1 will happily decode binary gibberish to text gibberish and never whisper a word of complaint. – ShadowRanger May 11 '19 at 00:22
  • 1
    @ShadowRanger Thank you for your explanation. It adds to my current knowledge. – hsinghal Jul 28 '19 at 04:05

3 Answers3

67

I think the best answer (in Python 3) is to use the errors= parameter:

with open('evil_unicode.txt', 'r', errors='replace') as f:
    lines = f.readlines()

Proof:

>>> s = b'\xe5abc\nline2\nline3'
>>> with open('evil_unicode.txt','wb') as f:
...     f.write(s)
...
16
>>> with open('evil_unicode.txt', 'r') as f:
...     lines = f.readlines()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte
>>> with open('evil_unicode.txt', 'r', errors='replace') as f:
...     lines = f.readlines()
...
>>> lines
['�abc\n', 'line2\n', 'line3']
>>>

Note that the errors= can be replace or ignore. Here's what ignore looks like:

>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:
...     lines = f.readlines()
...
>>> lines
['abc\n', 'line2\n', 'line3']
caleb
  • 2,687
  • 30
  • 25
22

Your default encoding appears to be ASCII, where the input is more than likely UTF-8. When you hit non-ASCII bytes in the input, it's throwing the exception. It's not so much that readlines itself is responsible for the problem; rather, it's causing the read+decode to occur, and the decode is failing.

It's an easy fix though; the default open in Python 3 allows you to provide the known encoding of an input, replacing the default (ASCII in your case) with any other recognized encoding. Providing it allows you to keep reading as str (rather than the significantly different raw binary data bytes objects), while letting Python do the work of converting from raw disk bytes to true text data:

# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
    f = inf.readlines()

If the file is some other encoding, you'd change encoding='utf-8' to the appropriate argument. Note that while some people will tell you to "Just use 'latin-1'" here if 'utf-8' doesn't work:

  1. That's often wrong (modern text editors tend to produce UTF-8 or UTF-16, with latin-1 being much less common; frankly, you're more likely to see Microsoft's 'latin-1' variant, 'cp1252', that's mostly the same but remaps some characters to support stuff like smart quotes), and
  2. Unlike the UTF encodings, the various byte-per-character ASCII superset encodings (including 'latin-1', 'cp1252', 'cp437', and many others) are not self-checking; if the data isn't in the encoding specified, they'll still happily decode it, it will just produce gibberish for stuff above the ASCII range.

In short, if your data isn't a UTF encoding (or one of the rare non-UTF self-checking encodings), you need to know the encoding used, or you're stuck guessing and checking the result to see if it makes sense (and for stuff like a source that might be latin-1 or cp1252, you'll never be sure unless it eventually contains a cp1252-specific character).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • 1
    I like the simplicity of this solution but I just tried it in python 3.6.8 and it fails. – M.H. Jun 17 '19 at 21:11
  • 2
    @M.H.: It will work *on UTF-8 data*. If it's not UTF-8, you need to figure out what it *is*. This will work just as well on 3.6.8 as on any other 3.x release (and on Python 2.6+ for that matter, if you do `from io import open` to replace the Py2 `open` with the Py3 version). If you don't know the encoding though, you're stuck guessing. – ShadowRanger Jun 17 '19 at 23:06
  • @r_e_cur: I rejected your edit because, even if your case happened to work with latin-1, latin-1 is a *trap*, and should not be anyone's first (or second, or third) attempt to solve the issue unless they *know*, without a shadow of a doubt, that the source data is *actually* in latin-1. It'll "work" with completely random bytes, and UTF-8 bytes, and UTF-16 bytes; decoding them all as latin-1 will get you a string, but that string will be garbage. UTF-8 is self-checking and therefore any meaningful amount of data will error if it's not *really* UTF-8, making it a much safer choice. – ShadowRanger Aug 11 '23 at 15:16
  • I did add notes on using it, but rather than including it as a code sample that will be copied and pasted without thinking, I made notes on why *not* to use it, and when you can use it. I strongly suspect latin-1 is wrong for you even if you say it works, because on most Western European Windows systems, cp1252 (which is similar to latin-1, but not exactly the same) is the actual default locale encoding (when the data isn't stored as UTF-16, which most Windows programs use nowadays), and on basically every non-Windows system outside of East Asia (and even some in it), UTF-8 is the default. – ShadowRanger Aug 11 '23 at 15:20
  • Oh, hmm. Misread, it wasn't r_e_cur who proposed the edit, it was an "anonymous user". I didn't even realize that was a thing on StackOverflow. *shrugs* I'll leave these comments in place if they ever come back to check. – ShadowRanger Aug 11 '23 at 15:34
-1

Ended up finding a working answer for myself:

filename=open(argfile, 'rb')

This post helped me out a lot.

r_e_cur
  • 457
  • 1
  • 4
  • 8
  • 2
    If you're actually using Python 3, this is going to dramatically change your behavior; opening in binary mode means not only do you not get line ending translation (admittedly only an issue on Windows), but you get back `bytes` objects instead of `str` (and must manually `decode` them if you want to work with `str`). I posted [an answer that avoids this](http://stackoverflow.com/a/35044042/364696) (assuming you know the encoding, which you'd need to know to perform the `decode` anyway). – ShadowRanger Jan 27 '16 at 17:26