0

There is at least one related question on SO that proved useful when trying to decode unicode sequences.

I am preprocessing a lot of texts with a lot of different genres. Some are economical, some are technical, and so on. One of the caveats is converting unicode sequences:

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek.

Such a string needs to be converted to actual characters:

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek.

which can be done like this:

s = "'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek."
s = s.encode('utf-8').decode('unicode-escape')

(At least this works when s is an input line taken from a utf-8 encoded text file. I can't seem to get this to work on an online service like REPL.it, where the output is encoded/decoded differently.)

In most cases, this works fine. However, when directory structure paths are seen in the input string (often the case for technical documents in my data set) then UnicodeDecodeErrors occur.

Given the following data unicode.txt:

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).

With bytestring representation of:

b"'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\\u0115ch \\u010camek, Financial Director and Director of Controlling.\r\nVoor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\\udfs\\math.dll (op Windows))."

The following script will fail when decoding the second line in the input file:

with open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
    lines = ''.join(fin.readlines())
    lines = lines.encode('utf-8').decode('unicode-escape')

    fout.write(lines)

With trace:

Traceback (most recent call last):
  File "C:/Python/files/fast_aligning/unicode-encoding.py", line 3, in <module>
    lines = lines.encode('utf-8').decode('unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 275-278: truncated \uXXXX escape

Process finished with exit code 1

How can I ensure that the first sentence is still 'translated' correctly, as shown before, but that the second one remains untouched? Expected output for the two lines given would thus be as follows, where the first line has changed and the second hasn't.

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • 1
    So just to be clear, the `unicode.txt` file itself actually contains strings that have `\uXXXX` escape sequences, and that's not just how Python prints it? – AKX Sep 20 '18 at 12:37
  • @AKX That's right. – Bram Vanroy Sep 20 '18 at 12:41
  • I think that the most headache-free solution would be to decode the file line by line with `try: ... except UnicodeDecodeError`. Obviously this will be slower and may hide actual encoding problems. – DeepSpace Sep 20 '18 at 12:41
  • @DeepSpace I've thought about that, but that would also mean that when an 'actual' unicode sequence is present when a 'fake' one is, then the actual one is not decoded. – Bram Vanroy Sep 20 '18 at 12:42
  • @BramVanroy you mean when we have both on the same line, right? – DeepSpace Sep 20 '18 at 12:43
  • @DeepSpace Yes. In such an event, the unicode sequence will not be converted. I was thinking that perhaps there is a tool/library that can detect unicode sequences, and to only convert those rather than converting the whole line. But I haven't found such a solution yet - and this would be very slow indeed. – Bram Vanroy Sep 20 '18 at 12:45
  • But how is the program supposed to know whether it should decode `\u1234` as `\u1234` or `ሴ` then? There is no way to tell without making assumptions about the input data, right? – Vincent Sep 20 '18 at 12:47
  • @BramVanroy Then try to decode word by word ;) That's indeed an interesting problem – DeepSpace Sep 20 '18 at 12:47
  • 2
    More interestingly, how should the program know whether the `\udf` in `c:\udfs` means an escape for Unicode character 0xDF or a part of a pathname? – AKX Sep 20 '18 at 12:49
  • @Vincent Well, that is exactly the issue. As I said in a previous comment, it would be cool if there was some sort of smart library that could detect _actual_ unicode sequences. – Bram Vanroy Sep 20 '18 at 12:50
  • @AKX I guess some basic regex could keep off the most basic things? But for now, I would also be satisfied if such a case is also seen as a sequence (and thus parsed). – Bram Vanroy Sep 20 '18 at 12:53
  • 1
    where did you get the file so it might contain both `r'\U0001F600'` (10 chars) and `r'c:\Users\...'`. Could you create a *minimal* file (several chars) that demonstrates the issue and then show `print(Path('broken.txt').read_bytes())` in the question and the corresponding desired output (what you want to get in the end). – jfs Sep 20 '18 at 18:50
  • @jfs I have no control over the file (it's externally delivered). I think my example gives a great minimal case of the issue. I added the expected result of the two input lines. – Bram Vanroy Sep 20 '18 at 20:36
  • 1
    @BramVanroy do you understand the difference between `'\uabcd'` and `r'\uabcd'`? *Both are Unicode strings.* One is a single character, another is six characters. To avoid ambiguity, for clarity, I've asked to show the content of the file as bytes (their repr()): `print(Path('broken.txt').read_bytes())` – jfs Sep 21 '18 at 04:19
  • Technically, the input file is mis-encoded. If Unicode escapes are used in a file, literal backslashes should be escaped (as double-backslashes) as well: `'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek.\nVoor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\\udfs\\math.dll (op Windows)).` – Mark Tolonen Sep 21 '18 at 07:41
  • @MarkTolonen Unfortunately, I don't have control over the input. – Bram Vanroy Sep 21 '18 at 07:48
  • Which is why you have to resort to regexs that can still match ambiguous paths. Your answer accounts for `c:\udfff`, but not `c:\other\udfff`, and paths could technically contain a real Unicode escape. – Mark Tolonen Sep 21 '18 at 07:50
  • @MarkTolonen Exactly. So I hope that there are some regex wizards out there that can give a little help. – Bram Vanroy Sep 21 '18 at 07:59
  • 2
    I think you'll always have a potential for ambiguity. Is `c:\other\u00f6` a two-directory path, or one directory named `c:\otherö`? You'll have to make an educated guess since the input is incorrectly encoded. – Mark Tolonen Sep 21 '18 at 08:38

3 Answers3

2

The raw_unicode_escape codec in the ignore mode seems to do the trick. I'm inlining the input as a raw byte longstring here, which should by my reasoning be equivalent to reading it from a binary file.

input = br"""
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).
"""

print(input.decode('raw_unicode_escape', 'ignore'))

outputs

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:s\math.dll (op Windows)).

Note that the \udf in d:\udfs gets mangled, as the codec attempts to start reading an \uXXXX sequence, but gives up at the s.

An alternative (likely slower) would be to use a regexp to find the valid Unicode sequences within decoded data. This assumes .decode()ing the full input string as UTF-8 is possible, though. (The .encode().decode() dance is necessary since strings can't be encoded, just bytes. One could also use chr(int(m.group(0)[2:], 16)).)

escape_re = re.compile(r'\\u[0-9a-f]{4}')
output = escape_re.sub(lambda m: m.group(0).encode().decode('unicode_escape'), input.decode()))

outputs

'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).

Since \udf doesn't have 4 hexadecimal characters, the d:\udfs path is spared here.

AKX
  • 152,115
  • 15
  • 115
  • 172
  • 1
    This works great (and relatively fast) - but it has one large issue, namely that special characters will throw syntax errors. For instance, add `België` (Dutch for `Belgium`) to that `input` and it won't work. Therefore, I think that your second idea is the way to go. I'll take the slow speed with it. – Bram Vanroy Sep 20 '18 at 14:03
  • 1
    To be precise, `b"België"` is a syntax error because bytes literals may only contain ASCII characters. But anyway, the first solution does not decode UTF-8, therefore it is not applicable if the string contains such characters. – Janne Karila Sep 21 '18 at 06:10
2

The input is ambiguous. The right answer does not exist in the general case. We could use heuristics that produce an output that looks right most of the time e.g., we could use a rule such as "if \uxxxx sequence (6 chars) is a part of an existing path then don't interpret it as a Unicode escape" and the same for \Uxxxxxxxx (10 chars) sequences e.g., an input that is similar to the one from the question: b"c:\\U0001f60f\\math.dll" can be interpreted differently depending on whether c:\U0001f60f\math.dll file actually exists on the disk:

#!/usr/bin/env python3
import re
from pathlib import Path


def decode_unicode_escape_if_path_doesnt_exist(m):
    path = m.group(0)
    return path if Path(path).exists() else replace_unicode_escapes(path)


def replace_unicode_escapes(text):
    return re.sub(
        fr"{unicode_escape}+",
        lambda m: m.group(0).encode("latin-1").decode("raw-unicode-escape"),
        text,
    )


input_text = Path('broken.txt').read_text(encoding='ascii')
hex = "[0-9a-fA-F]"
unicode_escape = fr"(?:\\u{hex}{{4}}|\\U{hex}{{8}})"
drive_letter = "[a-zA-Z]"
print(
    re.sub(
        fr"{drive_letter}:\S*{unicode_escape}\S*",
        decode_unicode_escape_if_path_doesnt_exist,
        input_text,
    )
)

Specify the actual encoding of your broken.txt file in the read_text() if there are non-ascii characters in the encoded text.

What specific regex to use to extract paths depends on the type of input that you get.

You could complicate the code by trying to substitute one possible Unicode sequence at a time (the number of replacements grows exponentially with the number of candidates in this case e.g., if there are 10 possible Unicode escape sequences in a path then there are 2**10 decoded paths to try).

jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

I had already written this code when AKX posted his answer. I still think it applies.

The idea is to capture unicode sequence candidates with a regex (and try to exclude paths, e.g. parts that are preceded with any letter and a colon (e.g. c:\udfff). If decoding does fail, we'll return the original string.

with open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
    lines = ''.join(fin.readlines())
    lines = lines.strip()
    lines = unicode_replace(lines)
    fout.write(lines)


def unicode_replace(s):
    # Directory paths in a text are seen as unicode sequences but will fail to decode, e.g. d:\udfs\math.dll
    # In case of such failure, we'll pass on these sentences - we don't try to decode them but leave them
    # as-is. Note that this may leave some unicode sequences alive in your text.
    def repl(match):
        match = match.group()
        try:
            return match.encode('utf-8').decode('unicode-escape')
        except UnicodeDecodeError:
            return match

    return re.sub(r'(?<!\b[a-zA-Z]:)(\\u[0-9A-Fa-f]{4})', repl, s)
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239