6

I have a plain ASCII file. When I try to open it with codecs.open(..., "utf-8"), I am unable to read single characters. ASCII is a subset of UTF-8, so why can't codecs open such a file in UTF-8 mode?

# test.py

import codecs

f = codecs.open("test.py", "r", "utf-8")

# ASCII is supposed to be a subset of UTF-8:
# http://www.fileformat.info/info/unicode/utf8.htm

assert len(f.read(1)) == 1 # OK
f.readline()
c = f.read(1)
print len(c)
print "'%s'" % c
assert len(c) == 1 # fails

# max% p test.py
# 63
# '
# import codecs
#
# f = codecs.open("test.py", "r", "utf-8")
#
# # ASC'
# Traceback (most recent call last):
#   File "test.py", line 15, in <module>
#     assert len(c) == 1 # fails
# AssertionError
# max%

system:

Linux max 4.4.0-89-generic #112~14.04.1-Ubuntu SMP Tue Aug 1 22:08:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Of course it works with regular open. It also works if I remove the "utf-8" option. Also what does 63 mean? That's like the middle of the 3rd line. I don't get it.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
personal_cloud
  • 3,943
  • 3
  • 28
  • 38
  • Are you sure you aren't reading 0 characters? – Ignacio Vazquez-Abrams Sep 27 '17 at 00:43
  • @Ignacio I added more debug to show that in fact I'm getting 63 characters. This is exactly what I saw in the real-system example originally too. – personal_cloud Sep 27 '17 at 00:47
  • 1
    Also print out the character itself, not just the length, and the byte code for the character. –  Sep 27 '17 at 00:47
  • 1
    The length suggests it includes the previous readline result as well; interesting. –  Sep 27 '17 at 00:48
  • @Evert Yes, very interesting indeed! It seems to be reading *starting* at the right spot, but it grabs the wrong number of characters (63 vs. 1). Also, by the way, `f.tell()` seems just as unreliable. Also come to think of it, in my last example it was off by 57 and the `readline()` was a little longer. Seems like it could be grabbing `(72? - len(readline())`... but why? – personal_cloud Sep 27 '17 at 00:51
  • 1
    Well, at least I can reproduce this on my Mac, with Python versions 2.7.10, 2.7.13 and 3.6.2. `len(c)` for me is 59 in all cases. –  Sep 27 '17 at 00:55
  • It likely is an issue with `readline()`, since (from the docs): "Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.". –  Sep 27 '17 at 00:56
  • 1
    Side-note: *Never* use `codecs.open`. It's buggy in weird ways (as @Evert notes, it has to open files in binary mode, which has all sorts of side-effects). Try [using `io.open` instead](https://docs.python.org/2/library/io.html#io.open) (it's the same as plain `open` on Py3, and provides the same interface on Py2), and is both faster and more correct than `codecs.open` (which is essentially deprecated). I suspect your problem will disappear. – ShadowRanger Sep 27 '17 at 01:11
  • @Evert. Hmm. I wouldn't think '\n' would need to be converted here, since it is both the actual value in the file and the value that `readline()` is searching for. Also I checked the `readline()` *result* and it looks fine... but perhaps it is having a side-effect. If I change `readline()` to `read(9)` then it works. I agree that the `readline()` would be suspect here, as it probably reads ahead several characters to work efficiently. – personal_cloud Sep 27 '17 at 01:14
  • @ShadowRanger yeah the `io.open()` is working better, even with the `utf-8` option. OK so yeah maybe can we find some good references on the deprecation/bugginess of `codecs`? I would accept that. – personal_cloud Sep 27 '17 at 01:18
  • 1
    @personal_clown: `codecs.open` is in a weird state. It's strictly inferior to `io.open` except in some very unusual cases (bytes<->bytes codecs like ROT13 and hex, as opposed to the standard bytes<->Unicode codecs). Because of the weirdo use cases, they've been reluctant to officially deprecate it, but if you check the Python bug tracker, it's often referred to as pseudo-deprecated. [PEP 400 includes a bunch of reasons](https://www.python.org/dev/peps/pep-0400/#streamreader-and-streamwriter-issues) why `StreamReader`/`StreamWriter`/`StreamReaderWriter` (what `codecs.open` creates) are broken. – ShadowRanger Sep 27 '17 at 01:30
  • Interesting that `f.read(1)` reads a whole line. I got the same results on Windows, but if I *only* do `f.read(1)` or *only* `f.readline()` on the whole file they work fine. It seems to be an interaction when you mix them. – Mark Tolonen Sep 27 '17 at 01:42
  • @ShadowRanger and Mark yes this looks an awful lot like [issue 8260](https://bugs.python.org/issue8260), doesn't it? OK I'll accept that as an answer. Thank you. – personal_cloud Sep 27 '17 at 02:13
  • @personal_clown: Heh, I'd found that bug a little bit ago, and it sent me down the rabbit hole that led to my answer. That bug seems to be fixing specific symptoms of the problem I describe, but fundamentally, it's an incompability between the definitions of `size` used by `StreamReaderWriter` and `StreamReader`. – ShadowRanger Sep 27 '17 at 02:17
  • `StreamReader`... I see. But it looked and quacked like a `file`!!! And now `io` is neither a `file` nor a `StreamReader`. Interesting. Yeah, the real bug is that they made something that was *almost* a `file`, and yet had the argument order mixed up. Thank you for your help. – personal_cloud Sep 27 '17 at 02:18
  • 1
    General advice: Stay *away* from `codecs.open()`. Use `io.open()` to use the far better Python 3 I/O implementation backported to Python 2. – Martijn Pieters Sep 28 '17 at 05:59

1 Answers1

5

Found your problem:

When passed an encoding, codecs.open returns a StreamReaderWriter, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader and StreamWriter. Problem is:

  1. StreamReaderWriter provides a "normal" read method (that is, it takes a size parameter and that's it)
  2. It delegates to the internal StreamReader.read method, where the size argument is only a hint as to the number of bytes to read, but not a limit; the second argument, chars, is a strict limiter, but StreamReaderWriter never passes that argument along (it doesn't accept it)
  3. When size hinted, but not capped using chars, if StreamReader has buffered data, and it's large enough to match the size hint StreamReader.read blindly returns the contents of the buffer, rather than limiting it in any way based on the size hint (after all, only chars imposes a maximum return size)

The API of StreamReader.read and the meaning of size/chars for the API is the only documented thing here; the fact that codecs.open returns StreamReaderWriter is not contractual, nor is the fact that StreamReaderWriter wraps StreamReader, I just used ipython's ?? magic to read the source code of the codecs module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter, it's all Python level, so it's easy).

The best solution is to switch to io.open, which is faster and more correct in every standard case (codecs.open supports the weirdo codecs that don't convert between bytes [Py2 str] and str [Py2 unicode], but rather, handle str to str or bytes to bytes encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes and str). All you need to do is import io instead of codecs, and change the codecs.open line to:

f = io.open("test.py", encoding="utf-8")

The rest of your code can remain unchanged (and will likely run faster to boot).

As an alternative, you could explicitly bypass StreamReaderWriter to get the StreamReader's read method and pass the limiting argument directly, e.g. change:

c = f.read(1)

to:

# Pass second, character limiting argument after size hint
c = f.reader.read(6, 1)  # 6 is sort of arbitrary; should ensure a full char read in one go

I suspect Python Bug #8260, which covers intermingling readline and read on codecs.open created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read and readline will be able to break it.

Again, just use io.open; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Yeah, I really only started seeing my problems right after I started mixing `readline` and `read`. Must be the 8260 as you say. Thanks for all the helpful research. I'll take a closer look at `io` and other options. – personal_cloud Sep 27 '17 at 02:30
  • 1
    @personal_clown: Yeah, `readline` causes the problem because, to avoid excessive system call overhead, it buffers (72 bytes IIRC). If you just performed `read(1)` calls, it would read one byte at a time without buffering (because you "hinted" you only needed a byte and it trusts the hint); you'd never get in a position where `size` vs. `chars` mattered. But `readline` prepopulates the buffer, so now `size` without `chars` means roughly "return greater of current buffer or `size` bytes worth of characters". – ShadowRanger Sep 27 '17 at 02:36