I have a raw text file containing only the following line, and no newline:
Q853 \u0410\u043D\u0434\u0440\u0435\u0439 \u0410\u0440\u0441\u0435\u043D\u044C\u0435\u0432\u0438\u0447 \u0422\u0430\u0440\u043A\u043E\u0432\u0441\u043A\u0438\u0439
The characters are escaped as shown above, meaning that the \u05E9
is really a backslash, followed by 5 alphanumeric characters (and not an Unicode character). I am trying to decode the file using the following code:
import codecs
with codecs.open("wikidata-terms20.nt", 'r', encoding='unicode_escape') as input:
with open("wikidata-terms3.nt", "w") as output:
for line in input:
output.write(line)
Using print
is not possible here, see in the comments.
Running it gives me the following error:
Traceback (most recent call last):
File "terms2.py", line 5, in <module>
for line in input:
File "C:\Program Files\Python35\lib\codecs.py", line 711, in __next__
return next(self.reader)
File "C:\Program Files\Python35\lib\codecs.py", line 642, in __next__
line = self.readline()
File "C:\Program Files\Python35\lib\codecs.py", line 555, in readline
data = self.read(readsize, firstline=True)
File "C:\Program Files\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 67-71: truncated \uXXXX escape
What is going on?
I am running Python 3.5.1 on Windows 8.1, and the code seems to work for most other Unicode characters (this line is the first one to cause the crash).
See edit history for the original question.