5

Calling tell() while reading a GBK-encoded file of mine causes the next call to readline() to raise a UnicodeDecodeError. However, if I don't call tell(), it doesn't raise this error.

C:\tmp>hexdump badtell.txt

000000: 61 20 6B 0D 0A D2 BB B0-E3                       a k......

C:\tmp>type test.py

with open(r'c:\tmp\badtell.txt', "r", encoding='gbk') as f:
    while True:
        pos = f.tell()
        line = f.readline();
        if not line: break
        print(line)

C:\tmp>python test.py

a k

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    line = f.readline();
UnicodeDecodeError: 'gbk' codec can't decode byte 0xd2 in position 0:  incomplete multibyte sequence

When I remove the f.tell() statement, it decoded successfully. Why? I tried Python3.4/3.5 x64 on Win7/Win10, it is all the same.

Any one, any idea? Should I report a bug?

I have a big text file, and I really want to get file position ranges of this big text, is there a workaround?

Dan Getz
  • 8,774
  • 6
  • 30
  • 64
mfmain
  • 99
  • 6
  • I had a big text file, and I really want file position ranges to show parts of this big text, is there a workaround? thanks – mfmain May 10 '16 at 02:51

2 Answers2

2

I just replicated this on Python 3.4 x64 on Linux. Looking at the docs for TextIOBase, I don't see anything that says tell() causes problems with reading a file, so maybe it is indeed a bug.

b'\xd2'.decode('gbk')

gives an error like the one that you saw, but in your file that byte is followed by the byte BB, and

b'\xd2\xbb'.decode('gbk')

gives a value equal to '\u4e00', not an error.

I found a workaround that works for the data in your original question, but not for other data, as you've since found. Wish I knew why! I called seek() after every tell(), with the value that tell() returned:

pos = f.tell()
f.seek(pos)
line = f.readline()

An alternative to f.seek(f.tell()) is to use the SEEK_CUR mode of seek() to give the position. With an offset of 0, this does the same as the above code: moves to the current position and gets that position.

pos = f.seek(0, io.SEEK_CUR)
line = f.readline()
Dan Getz
  • 8,774
  • 6
  • 30
  • 64
  • Great! f.seek(0, io.SEEK_CUR) works for me on python 3.4 x64/win7 – mfmain May 11 '16 at 01:58
  • Something's gone weird again, this time, seek does the bad thing: – mfmain May 11 '16 at 07:38
  • C:\tmp>hexdump -r badtell.txt 5B 74 61 67 5F 75 73 65 72 20 74 61 67 5F 47 65 74 4D 65 73 73 61 67 65 20 74 61 67 5F 6D 65 73 73 61 67 65 5D 0D 0A D3 C3 BB A7 CF FB CF A2 3A 20 57 4D 5F 55 53 45 52 20 7E 20 2E 0D 0A 0D 0A 2F 2F 20 54 0D 0A 2F 2F 20 CD F9 45 44 49 54 CE C4 B1 BE 0D 0A 0D 0A 0D 0A 0D 0A 0D 0A 0D 0A 0D 0A 0D 0A – mfmain May 11 '16 at 07:46
  • @mfmain weird, and with this other file data I get no error if I use your original code without `seek()`…no idea why – Dan Getz May 11 '16 at 12:00
  • thanks for your help anyway, and opening with "rb" does the trick for a workaround, it's implemented by Modules/_io/fileio.c:_io_FileIO_tell_impl() I guess which is a simple one, but textio.c:_io_TextIOWrapper_tell_impl() is far too complicated, oops, things got tough when it is complicated, and just have no time to check it up – mfmain May 12 '16 at 08:18
2

OK, there is a workaround, It works so far:

with open(r'c:\tmp\badtell.txt', "rb") as f:
    while True:
        pos = f.tell()
        line = f.readline();
        if not line: break
        line = line.decode("gbk").strip('\n')
        print(line)

I submitted an issue yesterday here: http://bugs.python.org/issue26990

still no response yet

mfmain
  • 99
  • 6