1

I have a file that contains ASCII lines like

"\u0627\u0644\u0625\u062f\u0627"

(including the quote marks). I want to output these lines with the actual UTF-8 characters, like

"الإدا"

(These happen to be Arabic, but a solution would presumably work fine for any Unicode code points, at least in the Basic plane.)

If I type in an ASCII string like that to the Python3 interpreter, say

s = '"\u0627\u0644\u0625\u062f\u0627"'

and then ask Python what the value of that variable is, it displays the string in the way I want:

'"الإدا"'

But if I readline() a file containing strings like that, and write each line back out, I just get the ASCII representation back out. In other words, this code:

for s in stdin.readlines(): stdout.write(s)

just gives me back an output file identical to the input file.

How do I convert the read-in string so it writes out as the UTF-8 (not just ASCII) output, including the non-ASCII UTF-8 characters?

I know I can parse the string and handle each \uXXXX sub-string individually using regex, slices and chr(int()). But surely there is a way to use Python's built-in handling of strings represented in this way, so I don't have to parse the strings myself, not to mention being faster. (And yes, if there are improperly represented \u strings in the input, I can deal with the resulting error msgs.)

Mike Maxwell
  • 547
  • 4
  • 11
  • 1
    Try `ast.literal_eval`. – Michael Butscher May 16 '23 at 18:16
  • Does setting the stdout encoding to UTF-8 help? https://stackoverflow.com/a/52372390/765091 – slothrop May 16 '23 at 18:18
  • @slothrop: No, the output encoding already is set to UTF-8, but just to be sure I tried the reconfigure('utf-8') things and I get the same result. I think the problem with that solution is that ASCII *is* UTF-8. – Mike Maxwell May 16 '23 at 18:31
  • Does this answer your question? [Not knowing a whole unicode character python](https://stackoverflow.com/questions/69638558/not-knowing-a-whole-unicode-character-python) – JosefZ May 16 '23 at 18:40
  • Does your file happen to be a JSON file? – Mark Tolonen May 16 '23 at 18:58
  • @MichaelButscher: Yes, this works. I wrapped the output of ast.literal_eval() in str(), since in some cases it was creating a Python dict. Feel free to make your comment an answer, although I'll wait a day or so to accept an answer. (I feel like there's some trivially easy way to do this, without calling a library, but haven't found one.) – Mike Maxwell May 16 '23 at 19:01
  • @MarkTolonen: That is one use case, but I'd like this to work even if the input is not JSON(L). For JSON input, there's probably a simple way to do this using the json library, right? – Mike Maxwell May 16 '23 at 19:26
  • Taking @MarkTolonen's answer below – Mike Maxwell May 16 '23 at 19:35

1 Answers1

2

To convert a string of that content, encode as ASCII first to create a byte string, then decode with the 'unicode-escape' codec:

s = r'"\u0627\u0644\u0625\u062f\u0627"'
print(s)
print(s.encode('ascii').decode('unicode-escape'))

Output:

"\u0627\u0644\u0625\u062f\u0627"
"الإدا"

Writing and reading a file that way:

with open('file.txt', 'w', encoding='unicode-escape') as f:
    f.write('"\u0627\u0644\u0625\u062f\u0627"')

with open('file.txt', 'r', encoding='unicode-escape') as f:
    print(f.read())

Content of file:

"\u0627\u0644\u0625\u062f\u0627"

Output:

"الإدا"

Solutions to support surrogate escapes. They need to be converted to actual Unicode code points and the surrogatepass error handler allows that, but requires another encode/decode cycle.

s = r'"\ud83c\uddfa\ud83c\uddf8"'
print(s)
print(s.encode('ascii').decode('unicode-escape').encode('utf-16le', errors='surrogatepass').decode('utf-16le'))

Output:

""
with open('file.txt', 'w', encoding='unicode-escape') as f:
    f.write('"\ud83c\uddfa\ud83c\uddf8"')

with open('file.txt', encoding='unicode-escape') as f:
    data = f.read().encode('utf-16le', errors='surrogatepass').decode('utf-16le')
    print(data)
    print(ascii(data)) # To see the Unicode codepoints

Output:

""
'"\U0001f1fa\U0001f1f8"'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Thanks, that works and is simple. I guess the trick is the 'unicode-escape' arg! – Mike Maxwell May 16 '23 at 19:36
  • I'll add for those coming here that it turns out my input file has Unicode surrogates, and Python chokes when I try to write those out. For now, I have a try...except to skip over those lines (and print out the exception). It's not clear to me whether the original file is erroneous in not having surrogate *pairs*, or whether encode...decode just doesn't handle them. – Mike Maxwell May 16 '23 at 20:06
  • Example input string that triggers this error: `\ud83c\uddfa\ud83c\uddf8` – Mike Maxwell May 16 '23 at 20:15
  • @MikeMaxwell Which part of the code triggered the error? I put that string in the file writing portion and it passed correctly on writing, but it wouldn't read it back. – Mark Tolonen May 16 '23 at 20:27
  • Here's the code I'm using where the surrogate pairs error gets triggered [having trouble with inputting code, I'm using the back-tick but my newlines get removed): `from sys import stdin, stdout for sLine in stdin.readlines(): stdout.write(sLine.encode('ascii').decode('unicode-escape'))` – Mike Maxwell May 16 '23 at 20:29
  • @MikeMaxwell The surrogates can't be *displayed* as they aren't legal Unicode code points, so writing to stdout is the issue. The code above can read the file, but `print()` fails. I've updated with a solution to decode the surrogate pairs. – Mark Tolonen May 16 '23 at 20:34
  • Thanks--I'm a little confused about what I'm getting out of this. The final encoding, IIUC, should be UTF-16LE. But Linux seems to think it's UTF-8 (at least if I put a few ASCII characters on the beginning and the end, otherwise it thinks it's a "SysEX file"). Looks like I need to read up on UTF-8 vs. UTF-16... – Mike Maxwell May 16 '23 at 21:46
  • @MikeMaxwell The final encoding...isn't...it is just unencoded Unicode code points. The `ascii()` display let's you know it is U+1F1FA and U+1F1F8 ([Regional Indicators](https://www.unicode.org/charts/PDF/U1F100.pdf)). In the intermediate steps, UTF-16 normally doesn't allow surrogates to be encoded, hence the `surrogatepass`, and non-BMP code points are encoded in UTF-16 using surrogates. When decoded normally, UTF-16 converts surrogate pairs back to a code point. – Mark Tolonen May 16 '23 at 21:53
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253691/discussion-between-mark-tolonen-and-mike-maxwell). – Mark Tolonen May 16 '23 at 21:56