7

a big thanks for everyone who helped me in my previous scenarios.I'm sure that somebody would have asked a similar question like before.this is my question.

my file belongs to Little-endian UTF-16 Unicode English text, with CRLF line terminators encoding,but i don't see it's working for our file standards. Normally i see here belongs to ASCII English text. how do i get this converted to it.

i used iconv -f UTF-16LE -t UTF-8 myfile.dat -o myfile.dat_test, but it's turning the whole file to UTF-8 Unicode (with BOM) English text, with CRLF line terminators , not pretty sure what's going on where.

jas
  • 10,715
  • 2
  • 30
  • 41
mac_online
  • 350
  • 1
  • 5
  • 18
  • Is everything fine except that you don't want the BOM? – jas Sep 26 '17 at 10:39
  • ideally it has to be ASCII English text – mac_online Sep 26 '17 at 11:21
  • 1
    UTF8 will be exactly equivalent to ASCII if all the characters are within the ASCII range (`<= 127 or 0x7f`). If your UTF-16 contains characters whose UTF8 encoding is more than one byte, you need another plan. In any case, this may be useful: https://zzz.buzz/2016/07/30/bom-in-iconv/ – jas Sep 26 '17 at 11:28
  • Maybe a better question is, why are you telling `iconv` to convert to UTF-8 if you want ASCII? – jas Sep 26 '17 at 11:30
  • then how to convert to ASCII, i found a misleading thread which lead me to the above icon cmd. – mac_online Sep 26 '17 at 12:43
  • If you run `iconv -l` you'll see all of the possible encodings. If you do `iconv -l | grep ASCII` you'll see all the ascii-related ones. Probably you'll see `US-ASCII`, in which case you can change your command to `iconv -f UTF-16LE -t US-ASCII ...`, e.g. – jas Sep 26 '17 at 13:06
  • tried this iconv -f UTF-16LE -t ASCII xyz.dat -o xyz.dat_test but getting this error "iconv: illegal input sequence at position 0" – mac_online Sep 26 '17 at 13:27
  • 1
    That's probably the BOM of your input file. If so, try removing it. – jas Sep 26 '17 at 13:57
  • worked! Thanks Sir – mac_online Sep 26 '17 at 18:04
  • Great! Glad it helped. – jas Sep 26 '17 at 19:13

1 Answers1

3

The issue here is that the BOM is a feature of 'UTF-16', not of 'UTF-16LE'.

Per http://unicode.org/faq/utf_bom.html#gen7:

The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

Note that the option to include a byte order mark applies only to "the unmarked form", meaning 'UTF-16'.

So when you tell iconv that the source encoding is 'UTF-16LE', and then the input starts with FF FE, iconv doesn't interpret the FF FE as a redundant indication of the byte order; rather, it interprets it as U+FEFF ZERO WIDTH NO-BREAK SPACE, and tries to copy that character to the output.

You can fix that by telling iconv that the source encoding is 'UTF-16'; then, when it sees that the input starts with FF FE, it will interpret it as a byte order mark, remove it, and interpret the rest of the input as little-endian.

So, change this:

iconv -f UTF-16LE -t UTF-8 myfile.dat -o myfile.dat_test

to this:

iconv -f UTF-16 -t US-ASCII myfile.dat -o myfile.dat_test

(Note: I've also changed the 'UTF-8' to 'US-ASCII', so that if there are any non-ASCII characters you'll get an explicit error instead of bad output.)

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • On Windows at least, but this program from Cygwin and Strawberry Perl does not support the `-o` option, so I will just redirect the output instead. – Pysis Feb 15 '23 at 18:57
  • @Pysis: Yup, makes sense! The OP used `-o`, so I just left that in place to avoid making unnecessary changes. – ruakh Feb 15 '23 at 21:38