1

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?

iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin

greybeard
  • 2,249
  • 8
  • 30
  • 66
Struggler
  • 31
  • 3
  • Maybe I missunderstood your question, but Chinese characters are not contained within [ASCII](https://en.wikipedia.org/wiki/ASCII) so you can not show them from a ASCII-encoded file. Your file should be [UTF](https://en.wikipedia.org/wiki/UTF-8) encoded. Though using [Base64](https://en.wikipedia.org/wiki/Base64) you can encode UTF characters in 8-Bit ASCII format, but you need to decode them to make them readable again. – ChristianB Dec 30 '20 at 11:49

2 Answers2

1

There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)

What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.

bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt 
仿Cafe鼀

Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.

This has the potentially undesired side effect of normalizing each character to its NFKD form.

Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Unfortunately, I'm not familiar enough with Chinese to know if there is a risk that some characters could also be decomposed into a normal form where some of the components are members of the `\p{M}` class. If you have a corpus of real Chinese text, test with that and examine the differences. – tripleee Dec 30 '20 at 13:52
0

The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.

The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:

sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin

where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

Jim Danner
  • 535
  • 2
  • 13
  • This assumes that you have an exhaustive list of accented characters (which may very well may be the case for a constrained or normalized input - say, originally using only the limited Latin-1 repertoire of precomposed accented characters - but a daunting task in the general case). The list in this answer doesn't even cover that (it lacks the accented vowel variants of i, o, u, and y, as well as accented consonants like ñ and ç, and some variants like å and æ). – tripleee Dec 30 '20 at 15:12
  • @tripleee Indeed, so its use is different than your answer (as stated in the accompanying text) – Jim Danner Dec 30 '20 at 15:20
  • I'm pointing out some reason why this code might not work even if it's superficially simple and attractive. Another concern is whether your `sed` supports Unicode, including normalization. There are multiple ways to write a glyph like ã (which is also is missing from your enumeration) either as a precomposed single glyph, or as a base *a* followed by a combining *~*. Unfortunately, `sed` implementations vary wildly in these aspects, whereas e.g. Perl has stably supported Unicode for almost two decades. – tripleee Dec 30 '20 at 15:45
  • For what it's worth, `s/ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöùúûüýÿ/AAAAAACEEEEIIIINOOOOOUUUUYaaaaaaceeeeiiiinooooouuuuyy/` should get you all the accented Latin-1 characters. This still doesn't include e.g. æ which isn't properly accented at all. – tripleee Dec 31 '20 at 05:41
  • Then, there is [tr(1)](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html). – greybeard Dec 31 '20 at 09:23