Replace two-character unicode

Question

This should be trivial but ...! I am writing to a UTF-8 encoded file and the text includes "Côte d'Ivoire". As I understand it "ô" is U+00F4. The character displays correctly everywhere but ends up in the file as U+C3B4 which should be in the Unicode Block HANGUL_SYLLABLES ("쎴").

Any attempt to replace U+C3B4 with U+00F4 seems to change nothing - all four lines of the file below contain it.

This creates a problem because when the file is eventually written to a database it displays as "CÃ´te d'Ivoire".

Update: If I use with io.open("Test.html", "w") as f_out: below then the file contains the correct U+00F4 which displays as a "?" The final database record still displays as "CÃ´te d'Ivoire" though :-(

MWE:

from __future__ import unicode_literals

import io

line="The current population of Côte d'Ivoire is 26,051,291"
for c in line:
    if ord(c) > 127:
            print(c, c.encode('utf-8').hex())
            line1 = line.replace(u"\uC3B4", "ô")
            line2 = line.replace(c, u"\u00F4")
            line3 = line.replace(c, "ô")



#with io.open("Test.html", "w", encoding="utf-8") as f_out:
    with io.open("Test.html", "w") as f_out:
            f_out.write(line+"\n")
        f_out.write(line1+"\n")
        f_out.write(line2+"\n")
        f_out.write(line3+"\n")

Hex editor:

00000000h: 54 68 65 20 63 75 72 72 65 6E 74 20 70 6F 70 75 ; The current popu
00000010h: 6C 61 74 69 6F 6E 20 6F 66 20 43 C3 B4 74 65 20 ; lation of CÃ´te 
00000020h: 64 27 49 76 6F 69 72 65 20 69 73 20 32 36 2C 30 ; d'Ivoire is 26,0
00000030h: 35 31 2C 32 39 31 0D 0A 54 68 65 20 63 75 72 72 ; 51,291..The curr
00000040h: 65 6E 74 20 70 6F 70 75 6C 61 74 69 6F 6E 20 6F ; ent population o
00000050h: 66 20 43 C3 B4 74 65 20 64 27 49 76 6F 69 72 65 ; f CÃ´te d'Ivoire
00000060h: 20 69 73 20 32 36 2C 30 35 31 2C 32 39 31 0D 0A ;  is 26,051,291..
00000070h: 54 68 65 20 63 75 72 72 65 6E 74 20 70 6F 70 75 ; The current popu
00000080h: 6C 61 74 69 6F 6E 20 6F 66 20 43 C3 B4 74 65 20 ; lation of CÃ´te 
00000090h: 64 27 49 76 6F 69 72 65 20 69 73 20 32 36 2C 30 ; d'Ivoire is 26,0
000000a0h: 35 31 2C 32 39 31 0D 0A 54 68 65 20 63 75 72 72 ; 51,291..The curr
000000b0h: 65 6E 74 20 70 6F 70 75 6C 61 74 69 6F 6E 20 6F ; ent population o
000000c0h: 66 20 43 C3 B4 74 65 20 64 27 49 76 6F 69 72 65 ; f CÃ´te d'Ivoire
000000d0h: 20 69 73 20 32 36 2C 30 35 31 2C 32 39 31 0D 0A ;  is 26,051,291..

@ben-quigley Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] on win32 — DLyons, Jan 05 '20 at 17:10
this char has `UNICODE` number `U+00F4` but `UTF-8` code `\xC3\xB4` - `UTF-8` doesn't meas `UNICODE`. — furas, Jan 05 '20 at 17:18
@furas Thanks - that starts to explain where the problem arises. How do I fix it? — DLyons, Jan 05 '20 at 17:19
UTF-8 is the *default* encoding for text files written in Python3. Is there a reason why this doesn't do what you want? `line="The current population of Côte d'Ivoire is 26,051,291"; with open("Test.html", "w") as f_out: f_out.write(line+"\n")` — Ben Quigley, Jan 05 '20 at 17:21
you may have problem with program which you use to display it - it may not use `UTF-8` (like console/terminal in Windows which uses `CP1250`) or it doesn't convert it to string because it always works only with bytes - like most of hex editors. SO there is nothing to fix in this code and file. You should rather read from database and decode from UTF-8 to UNICODE before display - maybe even it will decode it automatically. — furas, Jan 05 '20 at 17:23
@ben-quigley I can do this pretty much any way that works. I updated along the lines you suggested and that at least got rid of "C3B4" but the end-result ended up the same. — DLyons, Jan 05 '20 at 17:44
You are simply misunderstanding how UTF-8 works. The two bytes \xC3 \xB4 encode U-00F4, not U+C3B4. — tripleee, Jan 05 '20 at 17:46
The hex dump contains what looks like Latin-1 rendering of these code points. Maybe see also [the Stack Overflow `character-encoding` tag info page](/tags/character-encoding/info) which has a brief explanation of character encodings, and links to more resources. — tripleee, Jan 05 '20 at 17:50
@BenQuigley UTF-8 is not the default encoding for text files written in Python 3. It is platform dependent as is the value returned by `locale.getpreferredencoding(False)`. See the [open](https://docs.python.org/3/library/functions.html#open) documentation. Better to be explicit and always specify the encoding for text files. — Mark Tolonen, Jan 13 '20 at 05:10
So this was downvoted why? Perhaps because in a totally unrelated post I suggested that it was easy to do knee-jerk downvotes but much less easy to be constructive? — DLyons, Oct 18 '20 at 10:19

score 0 · Answer 1 · answered Jan 05 '20 at 17:46

0

Right - you're mixing apples and oranges, i.e. Unicode codepoints (notated U+XXXX) and bytes (Pythonically notated \xXX).

>>> l = "ô"  # our text to be ebcoded
>>> "U+%04x" % ord(l)
'U+00f4'  # the code point (ordinal encoded in hex)
>>> l.encode("utf-8")
b'\xc3\xb4'  # the UTF-8 encoded bytes

If you're actually trying to write an UTF-8 file, then you're basically done! You're writing UTF-8, in which ô happens to be a character that's encoded into two bytes.

answered Jan 05 '20 at 17:46

AKX

152,115
15
115
172

Thanks. But I read a UTF-8 file, extract information from it, and write the output to a UTF-8 file. Somewhere along the way the readable "ô" input autmagically gets transformed to "Ã´t" and I don't see where (or how to back-transform it). – DLyons Jan 05 '20 at 18:58
Then we'll need to see your reading, extraction and writing code. More likely than not you're reading an UTF-8 file as if it were ISO-8859-15 or similar, leading to that mojibake. – AKX Jan 05 '20 at 19:28
Taking out from __future__ import unicode_literals seems to partially help. I'm removing iat right now to see if that's the core problem. – DLyons Jan 05 '20 at 19:32
If `unicode_literals` has an effect, you're using Python 2 and you really should consider upgrading to Python 3, where Unicode is stricter and less painful. – AKX Jan 05 '20 at 19:40
I'm definitely on Python 3.7.3. I only put in unicode_literals as an experiment and it seemed to cause a "?" in the test file. – DLyons Jan 05 '20 at 19:45
Hm - All literals are Unicode in Py3 unless explicitly `b""` prefixed... – AKX Jan 05 '20 at 19:58

Replace two-character unicode

1 Answers1