Convert file from Little-endian UTF-16 Unicode English text, with CRLF line terminators to Ascii encoding

Question

a big thanks for everyone who helped me in my previous scenarios.I'm sure that somebody would have asked a similar question like before.this is my question.

my file belongs to Little-endian UTF-16 Unicode English text, with CRLF line terminators encoding,but i don't see it's working for our file standards. Normally i see here belongs to ASCII English text. how do i get this converted to it.

i used iconv -f UTF-16LE -t UTF-8 myfile.dat -o myfile.dat_test, but it's turning the whole file to UTF-8 Unicode (with BOM) English text, with CRLF line terminators , not pretty sure what's going on where.

UTF8 will be exactly equivalent to ASCII if all the characters are within the ASCII range (`<= 127 or 0x7f`). If your UTF-16 contains characters whose UTF8 encoding is more than one byte, you need another plan. In any case, this may be useful: https://zzz.buzz/2016/07/30/bom-in-iconv/ — jas, Sep 26 '17 at 11:28
Maybe a better question is, why are you telling `iconv` to convert to UTF-8 if you want ASCII? — jas, Sep 26 '17 at 11:30
then how to convert to ASCII, i found a misleading thread which lead me to the above icon cmd. — mac_online, Sep 26 '17 at 12:43
If you run `iconv -l` you'll see all of the possible encodings. If you do `iconv -l | grep ASCII` you'll see all the ascii-related ones. Probably you'll see `US-ASCII`, in which case you can change your command to `iconv -f UTF-16LE -t US-ASCII ...`, e.g. — jas, Sep 26 '17 at 13:06
tried this iconv -f UTF-16LE -t ASCII xyz.dat -o xyz.dat_test but getting this error "iconv: illegal input sequence at position 0" — mac_online, Sep 26 '17 at 13:27
That's probably the BOM of your input file. If so, try removing it. — jas, Sep 26 '17 at 13:57

ruakh · Answer 1 · 2023-05-28T12:33:29.203

The issue here is that the BOM is a feature of 'UTF-16', not of 'UTF-16LE'.

Per http://unicode.org/faq/utf_bom.html#gen7:

The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

Note that the option to include a byte order mark applies only to "the unmarked form", meaning 'UTF-16'.

So when you tell iconv that the source encoding is 'UTF-16LE', and then the input starts with FF FE, iconv doesn't interpret the FF FE as a redundant indication of the byte order; rather, it interprets it as U+FEFF ZERO WIDTH NO-BREAK SPACE, and tries to copy that character to the output.

You can fix that by telling iconv that the source encoding is 'UTF-16'; then, when it sees that the input starts with FF FE, it will interpret it as a byte order mark, remove it, and interpret the rest of the input as little-endian.

So, change this:

iconv -f UTF-16LE -t UTF-8 myfile.dat -o myfile.dat_test

to this:

iconv -f UTF-16 -t US-ASCII myfile.dat -o myfile.dat_test

(Note: I've also changed the 'UTF-8' to 'US-ASCII', so that if there are any non-ASCII characters you'll get an explicit error instead of bad output.)

On Windows at least, but this program from Cygwin and Strawberry Perl does not support the `-o` option, so I will just redirect the output instead. — Pysis, Feb 15 '23 at 18:57
@Pysis: Yup, makes sense! The OP used `-o`, so I just left that in place to avoid making unnecessary changes. — ruakh, Feb 15 '23 at 21:38

Convert file from Little-endian UTF-16 Unicode English text, with CRLF line terminators to Ascii encoding

1 Answers1

Linked