Why is iconv in Linux not converting spanish char in UTF-8 to ISO-8859-1 correctly

Question

In Linux, I am converting UTF-8 to ISO-8859-1 file using the following command:

iconv -f UTF-8 -t ISO-8859-1//TRANSLIT input.txt > out.txt

After conversion, when I open the out.txt

¿Quién Gómez is translated to ¿Quien Gomez.

Why are é and ó and others not translated correctly?

It works correctly for me. Out of curiosity, what happens if you drop the `//TRANSLIT`? — Keith Thompson, Aug 01 '14 at 18:29
Thank you for your comment. If I drop the //TRANSLIT, I get the error iconv:illegal input sequence at position 7, and it stops at ¿Quie in the out.txt. What am I doing wrong? I use fedora13 and it says LANG=en_US.utf8 when I type locale. Thank you. — user1026669, Aug 01 '14 at 18:35
Are you sure the input file is UTF-8 encoded? What does `file input.txt` say? — Keith Thompson, Aug 01 '14 at 18:43
Acutally,it is the output from Oracle sqlplus run in batch with the 'export NLS_LANG=AMERICAN_AMERICA.AL32UTF8. When I do file -bi on the sqlplus output file, it says charset=utf-8. So, I use utf-8 for the iconv command. There is no AL32UTF8 option on the iconv. Do you think that is the reason? Thank you. — user1026669, Aug 01 '14 at 18:47
$ od -c input.txt 0000000 302 277 Q u i e 314 201 n G o 314 201 m 0000020 e z \n 0000023 — user1026669, Aug 01 '14 at 18:55
I had never heard of AL32UTF8 before. [Apparently](http://oracleappstechnology.blogspot.com/2007/10/difference-between-utf8-and-al32utf8.html) it differs from UTF-8 in its handling of supplementary characters. Hmm, in `input.txt`, is `é` represented as the UTF-8 2-byte sequence for `U+00E9`, or as a sequence of `e` with a combining character representing the accent? Ok, it looks like the input has an unaccented letter `e` followed by what's probably `U+0301`, COMBINING ACUTE ACCENT. — Keith Thompson, Aug 01 '14 at 18:56
I showed the od output. Does this help or do you need others to see better? For é, it has e followed by 314 201 integer. — user1026669, Aug 01 '14 at 19:01
Are you sure you need ISO-8859-1 output? Why can't you just keep it in UTF-8 form? (This isn't to imply that you don't have a valid reason, I'm just wondering what it is; UTF-8 is preferable for most puroses.) — Keith Thompson, Aug 01 '14 at 19:05
BTW, in one of the web page says," the only difference between AL32UTF8 and UTF8 character sets is that AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate characters encoded using UTF-8 (or six bytes per character). Besides this storage difference, another difference is better support for supplementary characters in AL32UTF8 character set." — user1026669, Aug 01 '14 at 19:06
Keith, Thank you for the comment. The reason I need to convert it to iso-8859-1 is that I use a2ps command to convert the oracle output report to pdf file in batch mode and put it on the web. The a2ps does not understand the utf8 char encoding, so I have to convert the report output with spanish characters to iso-8859-1 using the iconv. I think I tried enscript command but no success. As you mentioned, the cause of the problem may be because of the difference between AL32UTF8 and UTF8. And I am not familiar with char encoding, Spanish. Do you have any other suggestions? Thank you very much. — user1026669, Aug 01 '14 at 19:12
Consider posting a more specific question about how to convert UTF-8 with combining characters to equivalent UTF-8 (Or Latin-1) without combining characters. Give the example of converting `'e'` followed by `U+0301` COMBINING ACUTE ACCENT to `'é'` `U+00E9` LATIN SMALL LETTER E WITH ACUTE. Be sure to mention that you know such a conversion isn't possible in all cases. (If you can convert UTF-8 to UTF-8 in this way, you can then convert the resulting UTF-8 to Latin-1.) (Or look into how to get Oracle to create output without unnecessary combining characters.) — Keith Thompson, Aug 02 '14 at 21:38

score 2 · Answer 1 · answered Aug 01 '14 at 19:03

2

There are (at least) two ways to represent the accented letter é in Unicode: as a single code point U+00E9, LATIN SMALL LETTER E WITH ACUTE, and as a two-character sequence e (U+0065) followed by U+0301, COMBINING ACUTE ACCENT.

Your input file uses the latter encoding, which iconv apparently is unable to translate to Latin-1 (ISO-8859-1). With the //TRANSLIT suffix, it passes through the unaccented e unmodified and drops the combining character.

You'll probably need to convert the input so it doesn't use combining characters, replacing the sequence U+0065 U+0301 by a single code point U+00E9 (represented in 2 bytes). Either that, or arrange for whatever generates your input file to use that encoding in the first place.

So that's the problem; I don't currently know exactly how to correct it.

answered Aug 01 '14 at 19:03

Keith Thompson

254,901
44
429
631

Keith, Thank you very much for your answer. Since I am not familiar with char encoding, I have to study your suggestions to see if I can find the proper solution. BTW, is it possible to mechanically change the char sequence encoding to another encoding using sed or other command? Thank you. – user1026669 Aug 01 '14 at 19:23
@user1026669: I have little doubt that there's a way to automatically convert UTF-8 using combining characters to equivalent UTF-8 that doesn't. (Note that there can be multiple combining characters on a single prefix character, such as an `e` with multiple accents; such a conversion couldn't handle those cases). I just don't happen to know how to do it. I'm sure others do. – Keith Thompson Aug 01 '14 at 19:27
Relevant but probably not immediately useful: http://stackoverflow.com/q/6936390/827263 – Keith Thompson Aug 01 '14 at 20:13
Keith, you are right. I found the answer from Oracle Community – user1026669 Aug 06 '14 at 19:32
@user1026669: I'm right about what? Did you find a way to avoid generating the combining characters? What is it? – Keith Thompson Aug 06 '14 at 19:36
See below for the detail. I could not put it in previs comments due to its length. – user1026669 Aug 06 '14 at 19:39

score 1 · Answer 2 · answered Aug 06 '14 at 19:40

Keith, you are right. I found the answer from Oracle Community Sergiusz Wolicki.
Here I am quoting his answer word for word. I am posting for somebody who may have this problem.

"The problem is that your data is stored in the Unicode decomposed form, which is legal but seldom used for Western European languages. 'é' is stored as 'e' (0x65=U+0065) plus combining acute accent (0xcc, 0x81 = U+0301). Most simple conversion tools, including standard Oracle client/server conversion, do not account for this and do not convert a decomposed characters into a pre-composed character from ISO 8859-1. They try to convert each of the two codes independently, yielding 'e' plus some replacement for the accent character, which does not exist in ISO 8859-1. You see the result correctly in SQL Developer because there is no conversion involved and SQL Developer rendering code is capable of combining the two codes into one character, as expected.

As 'é' and 'ó' have pre-composed forms available in both Unicode and ISO 8859-1, the work around is to add COMPOSE function to your query. Thus, set NLS_LANG as I advised previously and add COMPOSE around column expressions to your query."

Thank you very much, Keith

Why is iconv in Linux not converting spanish char in UTF-8 to ISO-8859-1 correctly

2 Answers2