11

I use the iconv library to interface from a modern input source that uses UTF-8 to a legacy system that uses Latin1, aka CP1252 (superset of ISO-8859-1).

The interface recently failed to convert the French string "Éducation", where the "É" was encoded as hex 45 CC 81. Note that the destination encoding does have an "É" character, encoded as C9.

Why does iconv fail converting that "É"? I checked that the iconv command-line tool that's available with MacOS X 10.7.3 says it cannot convert, and that the PERL iconv module fails too.

This is all the more puzzling that the precomposed form of the "É" character (encoded as C3 89) converts just fine.

Is this a bug with iconv or did I miss something?

Note that I also have the same issue if I try to convert from UTF-16 (where "É" is encoded as 00 C9 composed or 00 45 03 01 decomposed).

Neil Mayhew
  • 14,206
  • 3
  • 33
  • 25
Jean-Denis Muys
  • 6,772
  • 7
  • 45
  • 71

2 Answers2

8

Unfortunately iconv indeed doesn't deal with the decomposed characters in UTF-8, except the version installed on Mac OS X.

When dealing with Mac file names, you can use iconv with the "utf8-mac" character set option. It also takes into account a few idiosyncrasies of the Mac decomposed form.

However, non-mac versions of iconv or libiconv don't support this, and I could not find the sources used on Mac which provide this support.

I agree with you that iconv should be able to deal with both NFC and NFD forms of UTF8, but until someone patches the sources we have to detect this manually and deal with it before passing stuff to iconv.

Faced with this annoying problem, I used Perl's Unicode::Normalize module as suggested by Jukka.

#!/usr/bin/perl

use Encode qw/decode_utf8 encode_utf8/;
use Unicode::Normalize;

while (<>) {
    print encode_utf8( NFC(decode_utf8 $_) );
}
Sebastian
  • 6,293
  • 6
  • 34
  • 47
mivk
  • 13,452
  • 5
  • 76
  • 69
0

Use a normalizer (in this case, to Normalization Form C) before calling iconv.

A program that deals with character encodings (different representations of characters or, more exactly, code points, as sequences of bytes) and converting between them should be expected to treat precomposed and composed forms as distinct. The decomposed É is two code points and as such distinct from the precomposed É, which is one code point.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • 1
    Thanks. That doesn't answer the question why iconv does map the precomposed character to the destination encoding, but not the (admittedly distinct) decomposed character. Why not both? Why not the latter instead of the former? For a conversion tool/library, that is a failure, if not a bug. – Jean-Denis Muys Mar 28 '12 at 07:23
  • @Jean-Denis Muys, because the precomposed form is one Unicode character, which is representable in the target encoding according to mapping tables, whereas the decomposed form is two Unicode characters, and the latter is not representable in windows-1252 (CP1252). The correspondence between these forms does not exist at the level of character encodings; it is a higher-level protocol issue (and it is an equivalence of a specific kind, not identity). – Jukka K. Korpela Mar 28 '12 at 10:46
  • 2
    You are factually incorrect. There is no reason for not mapping a decomposed character into its CP-1252 equivalent. Whether "É" is using one representation or the other, it can - and should - be mapped to the CP-1252 "É" character. – Jean-Denis Muys Mar 28 '12 at 15:23
  • @Jean-Denis Muys, character encoding conversion operates on the encodings of characters, not the character properties. There is no law against making an encoding converter perform other functions as well, but it would not be modern modular design. – Jukka K. Korpela Mar 28 '12 at 18:02
  • What I suggest would still be "operating on the encodings of characters". This would not be another function. The fact is the **same** character ("É") has two representations in UTF-8, and iconv only converts one. – Jean-Denis Muys Mar 29 '12 at 14:57
  • Also note that this is not because those two representations use a different number of bytes, since iconv is quite happy to convert other characters that may be represented with 1, 2, 3 (and perhaps more) bytes. – Jean-Denis Muys Mar 29 '12 at 15:00
  • Perhaps the disagreement lies in what is the definition of "**same**" characters. In my opinion (and in the opinion of everybody I have been able to ask), "É" and "É" are the same character, however they may be encoded. – Jean-Denis Muys Mar 29 '12 at 15:04
  • Regardless of opinions, character encodings are representations of coded characters, and “É” U+00C9 is one coded character, whereas “É” U+0045 U+0301 is two coded characters. Whether they look the same (they may, or may not, depending on software), whether they behave identically in string data processing (they may, or may not), and whether there is a mapping called canonical equivalence between them are outside the scope of encodings and transcoding. – Jukka K. Korpela Mar 29 '12 at 16:15
  • 2
    "U+0045 U+0301 is two coded characters": I question that assertion. Do you have evidence? I don't see any reason why this shouldn't be considered as *one* character. If you want to represent the two "E" and "´" characters in Unicode, you would use U+0045 U+00B4, not U+0045 U+0301. That's precisely the point of Combining Diacritical Marks: to construct *one* character from two (or more). – Jean-Denis Muys Apr 01 '12 at 15:48
  • @Jean-Denis Muys, please consult the Unicode Standard for relevant definitions. – Jukka K. Korpela Apr 01 '12 at 16:07
  • 2
    I did. Unicode defines a character as "the smallest interpretable unit of stored text". A Combining Acute Accent is not interpretable without the element of text to which it applies (the preceding base character). Therefore a Combining Acute Accent is not a character. And in my context, the sequence U+0045 U+0301 *is* indeed the *smallest interpretable unit of stored text*. It is therefore but *one* character. – Jean-Denis Muys Apr 01 '12 at 16:23
  • The Unicode Standard is vague and messy around the character concept but not about this: “When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.” That’s the key here. It’s the (en)coded characters that character encodings and transcodings deal with. – Please note that this is a Q & A forum, not a discussion or debate forum. If you don’t want to accept the correct answer, that’s your privilege. I have already commented more than enough to explain why it is the correct answer. Let’s end this now. – Jukka K. Korpela Apr 02 '12 at 05:16
  • 1
    OK, I'll drop it. But you haven't provided any convincing evidence that you are right. The quote you give isn't. The reciprocal would be a bit more. On the other hand, the quote I provided seems stronger in favor of my interpretation, which by the way, seems common sense: "É" is the same as "É". Of course, common sense is often anything but. Thanks for your opinion, I simply don't think it's valid. – Jean-Denis Muys Apr 02 '12 at 15:46
  • Other than arguments about characters vs code points, I think if you look at the big picture, it's hard to argue that iconv's behavior is maximally *useful*. Jukka Korpela talks about "modern modular design", but I'm not convinced that there's any practical benefit to how iconv works as present. If it handled decomposed characters, it would be strictly more useful -- and obey the Principle of Least Surprise, to boot! – Matt R Jul 22 '14 at 11:19
  • I know this is an old discussion but I want to add: The composed item (One Unicode **codepoint**: `printf '\uc9\n'` which uses one byte `\xc9` in ISO-8859-1 and two bytes `\xc3\x89` in UTF-8) and the decomposed item (Two **codepoint**s `printf '\u45\u301'` which use three bytes `\x45\xcc\x81` and has no representation in ISO-8859-1) are both presented with the same **Glyph** in a text editor ([*A glyph is an individual character. It might be a letter, an accented letter, a ligature, a punctuation mark, a dingbat, etc.*](https://graphicdesign.stackexchange.com/a/45164)) (Cont..) –  Oct 13 '18 at 16:17
  • (Cont...) In short: "an **image** of a character in layman terms. A "glyph" might be an emoticon (which I find difficult to call a "character"), but is called "a character" in typography and font parlance. And, also, a glyph could be two "characters" like the **ligature** of `fi`. That one "glyph" has two numeric codes (and sometimes more, the single Unicode code-point `U-1F82` or `printf '\U1F82'` has one "letter" and three diacritical that may appear in several orders) is a quirk of our computer representation of human language. (Cont..) –  Oct 13 '18 at 17:08
  • (Cont..) A **character** is not equivalent to an Unicode **codepoint** which is also not (exactly) the same as a **glyph**. The important problem this causes is that two filenames that appear to be visually the same are not the same sequence of bytes. A computer will differentiate between a composed and a decomposed form. Even if the glyphs "look" the same. It is, therefore, incorrect that iconv should convert forms by default. If you need to convert between the (four NFC NFD NFKC NFKD) composition forms use [Uconv](https://packages.debian.org/stretch/icu-devtools-dbg) like `uconv -x any-nfc`. –  Oct 13 '18 at 17:34
  • @Jean-DenisMuys *"É" and "É" are the same character, however they may be encoded* No, they are not. They are the same **glyph** (image, font point) but not the same Unicode **codepoint**, and may be several distinct "characters" depending on the code-page used. –  Oct 13 '18 at 17:40
  • This pedantic word salad is ignoring many things. For example that I was not in the context of files names (and BTW macOS will not let me create two files where the names differ only with "É" and "É", clearly considering the two "É" the same *character*). Also when the French text I needed to convert has the word "Éducation", ask any French linguist whether it's not the same as "Éducation". ie you ignore that text encoding are mostly used to represent human languages (and not only file names). Also you totally ignore the definition of "character" by Unicode itself. – Jean-Denis Muys Oct 14 '18 at 20:03