5

I have a raw text file containing only the following line, and no newline:

Q853 \u0410\u043D\u0434\u0440\u0435\u0439 \u0410\u0440\u0441\u0435\u043D\u044C\u0435\u0432\u0438\u0447 \u0422\u0430\u0440\u043A\u043E\u0432\u0441\u043A\u0438\u0439

The characters are escaped as shown above, meaning that the \u05E9 is really a backslash, followed by 5 alphanumeric characters (and not an Unicode character). I am trying to decode the file using the following code:

import codecs

with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
    with open("wikidata-terms3.nt", "w") as output:
        for line in input:
            output.write(line)

Using print is not possible here, see in the comments.

Running it gives me the following error:

Traceback (most recent call last):
  File "terms2.py", line 5, in <module>
    for line in input:
  File "C:\Program Files\Python35\lib\codecs.py", line 711, in __next__
    return next(self.reader)
  File "C:\Program Files\Python35\lib\codecs.py", line 642, in __next__
    line = self.readline()
  File "C:\Program Files\Python35\lib\codecs.py", line 555, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Program Files\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 67-71: truncated \uXXXX escape

What is going on?

I am running Python 3.5.1 on Windows 8.1, and the code seems to work for most other Unicode characters (this line is the first one to cause the crash).

See edit history for the original question.

pie3636
  • 795
  • 17
  • 31
  • Please verify that you get the error with the data you have here. With `print(line)` in place of the process line comment, I get no error. – James K Sep 02 '16 at 08:31
  • The first line causing an error is `Q501 \u05E9\u05D0\u05E8\u05DC \u05D1\u05D5\u05D3\u05DC\u05E8`, and if tails to decode the first character. However U+05E9 seems to be valid Unicode. – pie3636 Sep 02 '16 at 08:35
  • Still can't reproduce. It prints a line of what looks like hebrew. – James K Sep 02 '16 at 08:40
  • Creating a file containing only the line `Q501 \u05E9\u05D0\u05E8\u05DC \u05D1\u05D5\u05D3\u05DC\u05E8`, with no newline, causes the error for me. I am also using a mere `print` instead of the comment. – pie3636 Sep 02 '16 at 08:45
  • If you are on Linux or similar (MacOS, cygwin), can you please `hexdump` the input file: `hexdump -C file.txt`? – Leon Sep 02 '16 at 08:56
  • Still no error for me. Perhaps someone else can reproduce. – James K Sep 02 '16 at 09:00
  • Here is the hexdump of the file : http://pastebin.com/cELKNSWk – pie3636 Sep 02 '16 at 09:01
  • @JamesK I tried to reporduce this on Fedora 24. It runs fine for me too. Also could you check the locale on your system and the character encoding and report back here? – Shubham Vasaikar Sep 02 '16 at 09:23
  • `locale.getlocale()` returns `(None,None)`, and `locale.getdefaultlocale()` returns `('fr_FR', 'cp1252')`. Running Windows 8.1 – pie3636 Sep 02 '16 at 09:27
  • 1
    That's a problem with printing. Note how the error mentions encoding to cp850. See [Python, Unicode, and the Windows console](http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console) – roeland Sep 02 '16 at 09:50
  • I edited the original post with the problem, after fixing this. `print` seems indeed to be unusable here. – pie3636 Sep 02 '16 at 10:00
  • Currently the error is at position 67-71. Can you insert spaces in the beginning of your line, and check if the error position moves correspondingly? – Leon Sep 02 '16 at 10:14
  • It does, but in an extremely weird way. Adding one chracter doesn't change the position. Adding two changes it to 68-71. Adding three changes it to 69-71, four to 70-71, five to another error: `UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 71: \ at end of string`. Adding a sixth one sets the error to position 67-71, adding a seventh one as well, adding an eight one takes it to 68-71, and so on. – pie3636 Sep 02 '16 at 10:19
  • 1
    Reproduced on my Windows 7 with python 2.7. Will debug it now. – Leon Sep 02 '16 at 10:30
  • 1
    I don't fully understand the problem, but seem to have found a workaround. See my answer. – Leon Sep 02 '16 at 10:43

1 Answers1

2

It seems that the data read by the decoder is truncated at (after) character#72 (0-based character #71). That obviously is somehow related to the this bug.

The following code produces the same error as in your example:

open("wikidata-terms20.nt", 'r').readline()
open("wikidata-terms20.nt", 'r').readline(72)

Increasing the readline size above the actual size of the input or setting it to -1 eliminates the error:

open("wikidata-terms20.nt", 'r').readline(1000)
open("wikidata-terms20.nt", 'r').readline(-1)

Evidently, for line in input: obtains the line to be decoded with readline(), effectively truncating the data-to-be-decoded to 72 characters.

So here are a couple of workarounds:

Workaround 1:

import codecs

with open("wikidata-terms20.nt", 'r') as input:
    with open("wikidata-terms3.nt", "w") as output:
        for line in input:
            output.write(codecs.decode(line, 'unicode_escape'))

Workaround 2:

import codecs

with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
    with open("wikidata-terms3.nt", "w") as output:
        for line in input.readlines():
            output.write(line)
Leon
  • 31,443
  • 4
  • 72
  • 97
  • Thank you, that worked perfectly! Side note, I had to specifty `encoding='utf-8'` for the `output.write` line not to fail. – pie3636 Sep 02 '16 at 11:53