0

First time doing Python in a while, and I'm having trouble doing a simple scan of a file when I run the following script with Python 3.0.1,

with open("/usr/share/dict/words", 'r') as f:
   for line in f:
       pass

I get this exception:

Traceback (most recent call last):
  File "/home/matt/install/test.py", line 2, in <module>
    for line in f:
  File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
    line = self.readline()
  File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
    while self._read_chunk():
  File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
    output = self.decoder.decode(input, final=final)
  File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data

The line in the file it blows up on is "Argentinian", which doesn't seem to be unusual in any way.

Update: I added,

encoding="iso-8559-1"

to the open() call, and it fixed the problem.

Matt R
  • 9,892
  • 10
  • 50
  • 83
  • Are you sure that you didn't mean `iso-8859-1`? That seems to be much more common. Plus, \xf3 is "ó" in Asunción in iso-8859 (and it's code-point U+00F3 in Unicode), but in UTF-8, it would be represented as '\xc3\xb3'. – Michael Lorton Aug 02 '11 at 06:41
  • @Malvolio: It's entirely possible I typed the encoding name wrong ;-) – Matt R Aug 02 '11 at 10:20

2 Answers2

1

Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:

iconv -f UTF-8 /usr/share/dict/words -o /dev/null

There are other ways to do the same thing.

Community
  • 1
  • 1
Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
1

How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?

Try this at the interactive prompt:

buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • I had it print out the line as it looped over them, and assumed it blew up on the line after the last printed when the exception was thrown. But I tried what you suggested, and it seems to blow up at a different point, I got: buf[9881-20:9881+20]= b"as\nAsturias's\nAsunci\xf3n\nAsunci\xf3n's\nAswan\n", which is indeed has a funny character in "Asunción". – Matt R Jun 19 '09 at 12:10
  • (1) 9881-1689 == 8192 == multiple of chunk-size (2) Not seems, DID blow up at file offset 9881, confirmed by iconv experiment. (3) no guarantee that stdout is flushed when exception raised, that's why you saw Argentina (4) neither funny-haha nor funny-peculiar; try neutral terminology like "non-ASCII" (5) specifying encoding="iso-8559-1" fixed the problem only if you are 100% certain that that is the correct encoding -- as that encoding uses all 256 8-bit code points, any old executable or file of encrypted random bytes will "successfully" decode. – John Machin Jun 19 '09 at 14:18
  • Thanks for your technical comments, it now makes sense. (While I'll try neutral terminology, perhaps you could try to lighten up a little? After all, this is a coding website, not international diplomacy.) – Matt R Jun 19 '09 at 21:11