4

Actually i have to parse some files which can be in any form of endian (Big or Little). Perl interpreter dies if I use one encoding and parse other.

open (my $fh, "<:raw:encoding(UTF-16LE):crlf", $ARGV[0]) or die cannot open file for reading : $! \n";

or

open (my $fh, "<:raw:encoding(UTF-16BE):crlf", $ARGV[0]) or die cannot open file for reading : $! \n";

output (for a file in LE and perl's encoding being BE)

UTF-16BE:Malformed HI surrogate dc00 at toASCII.pl line 123.
Pradeep
  • 109
  • 5
  • Possible duplicate of [Finding if the system is little endian or big endian with perl](http://stackoverflow.com/questions/2610849/finding-if-the-system-is-little-endian-or-big-endian-with-perl) – Eli Sadoff Dec 15 '16 at 20:50
  • 2
    @EliSadoff This question asks how to find whether a particular *file* is big or little endian, not whether the system is. – ThisSuitIsBlackNot Dec 15 '16 at 20:56
  • 2
    why is "my $fh" in quotes?! :-O – choroba Dec 15 '16 at 20:57
  • 1
    If the interpreter dies, just use [eval](http://p3rl.org/eval) or [Try::Tiny](http://p3rl.org/Try::Tiny). – choroba Dec 15 '16 at 20:58
  • 1
    Have you tried just UTF-16 instead of UTF-16BE and UTF-16LE? – ThisSuitIsBlackNot Dec 15 '16 at 21:00
  • 1
    I don't know how you could possibly know (Imagine a file of random binary data, for example) unless you knew what the contents of the file were _supposed_ to look like. Only then could you check. – Mort Dec 15 '16 at 21:01
  • @ThisSuitIsBlackNot, you are right, the question is not for system but for rthe file i am reading. I tried only "UTF-16" also but in that case it dies for both files which have BE/LE data in them. – Pradeep Dec 15 '16 at 21:28
  • It sounds like the files have no BOM, so you're stuck trying one encoding and switching to the other if it fails. – ThisSuitIsBlackNot Dec 15 '16 at 21:33
  • @choroba edited "my $fh" to be my "$fh". I manually wrote the code here to simplify things. – Pradeep Dec 15 '16 at 21:34
  • @ThisSuitIsBlackNot some files are with BOM some without. And yes, in UTF-16, UTF-16BE, UTF-16BE. – Pradeep Dec 16 '16 at 21:34

2 Answers2

5

Most UTF-16le files are valid UTF-16be files, and vice-versa. For example, there's no way to tell if 0A 00 indicates U+000A (UTF-16le) or U+0A00 (UTF-16be). So, assuming there's no BOM, you have to guess.

Possible heuristics (in descending order of reliability):

  1. U+FFFE is not a character (guaranteed).
    • If the file starts with FF FE, then it must be UTF-16le.
    • If the file starts with FE FF, then it must be UTF-16be.
    • If the file isn't valid UTF-16be, then it must be UTF-16le.
    • If the file isn't valid UTF-16le, then it must be UTF-16be.
    • If the file contains non-characters when decoded using UTF-16be, then it must be UTF-16le.
    • If the file contains non-characters when decoded using UTF-16le, then it must be UTF-16be.
  2. U+0A00 isn't currently assigned, but U+000A (LINE FEED) is quite common.
    U+0D00 isn't currently assigned, but U+000D (CARRIAGE RETURN) is quite common.
    • If the file contains 0A 00 or 0D 00, then it's probably UTF-16le.
    • If the file contains 00 0A or 00 0D, then it's probably UTF-16be.
    • If the file contains unassigned characters when decoded using UTF-16be, then it's probably UTF-16le.
    • If the file contains unassigned characters when decoded using UTF-16le, then it's probably UTF-16be.
  3. Heuristics based on knowledge of the file format. (Example)
  4. A file is likely to contain more ASCII characters than characters numbers U+xx00
    • If the file contains many xx 00 and few 00 xx, then it's probably UTF-16le.
    • If the file contains many 00 xx and few xx 00, then it's probably UTF-16be.

Notes:

  • #4 and #5 say "it's probably" instead of "it must be" because what's unassigned today could be assigned tomorrow.
  • #3 includes #1, but #1 is a cheap test.
  • #5 includes #4, but #4 is almost as reliable as #5 without maintaining a long list of unassigned characters that changes over time.

You could slurp in the file using :raw, perform some or all of the above tests on it to determine the encoding, then use decode and s/\r\n/\n/g.

ikegami
  • 367,544
  • 15
  • 269
  • 518
1

You don't show any code, but in general it's impossible to tell what endianness a file is unless you know what values you should be reading from the file. Many file formats, for instance, reserve a few bytes at the beginning to indicate what the format is, and if this applies to the data you are dealing with then you can just read those bytes, and change the open mode if you don't get what you're expecting

Alternatively, since your program dies if the wrong format is chosen, then you can use that to test whether the chosen format is correct. Something like this should suit

my $file = $ARGV[0];

open my $fh, '<:raw:encoding(UTF-16LE):crlf', $file or die $!;

eval { do_stuff_that_may_crash() };

if ( $@ ) {
    if ( $@ =~ /Malformed HI surrogate/ ) {
        open my $fh, '<:raw:encoding(UTF-16BE):crlf', $file or die $!;
        do_stuff_that_may_crash();
    }
    else {
        die $@;
    }
}

but since it sounds like do_stuff_that_may_crash() is pretty much all of your program, you should probably find a better criterion

Borodin
  • 126,100
  • 9
  • 70
  • 144