There is at least one related question on SO that proved useful when trying to decode unicode sequences.
I am preprocessing a lot of texts with a lot of different genres. Some are economical, some are technical, and so on. One of the caveats is converting unicode sequences:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek.
Such a string needs to be converted to actual characters:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek.
which can be done like this:
s = "'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek."
s = s.encode('utf-8').decode('unicode-escape')
(At least this works when s
is an input line taken from a utf-8
encoded text file. I can't seem to get this to work on an online service like REPL.it, where the output is encoded/decoded differently.)
In most cases, this works fine. However, when directory structure paths are seen in the input string (often the case for technical documents in my data set) then UnicodeDecodeError
s occur.
Given the following data unicode.txt
:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).
With bytestring representation of:
b"'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\\u0115ch \\u010camek, Financial Director and Director of Controlling.\r\nVoor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\\udfs\\math.dll (op Windows))."
The following script will fail when decoding the second line in the input file:
with open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
lines = ''.join(fin.readlines())
lines = lines.encode('utf-8').decode('unicode-escape')
fout.write(lines)
With trace:
Traceback (most recent call last):
File "C:/Python/files/fast_aligning/unicode-encoding.py", line 3, in <module>
lines = lines.encode('utf-8').decode('unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 275-278: truncated \uXXXX escape
Process finished with exit code 1
How can I ensure that the first sentence is still 'translated' correctly, as shown before, but that the second one remains untouched? Expected output for the two lines given would thus be as follows, where the first line has changed and the second hasn't.
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).