1

I've got a Perl program that I wrote on Windows. It starts with:

$unused_header = <STDIN>;
my @header_fields = split('\|\^\|', $unused_header, -1);

Which should split input that consists of a very large file of:

The|^|Quick|^|Brown|^|Fox|!|

Into:

{The, Quick, Brown, Fox|!|}

Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.

It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.

I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?

I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.

John Humphreys
  • 37,047
  • 37
  • 155
  • 255

3 Answers3

5

If STDIN is UTF-16, use one of the following

binmode(STDIN, ':encoding(UTF-16le)');   # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)');   # The other byte order.
binmode(STDIN, ':encoding(UTF-16)');     # Use BOM to determine byte order.
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Good call, I had to use binmode(STDIN, ':encoding(UTF-16)'); on STDIN and STDOUT since my file rewrote the files. If you just use it on one you get mismatches in printing if you have unicode characters. – John Humphreys Sep 25 '12 at 00:16
3

Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.

Community
  • 1
  • 1
hlovdal
  • 26,565
  • 10
  • 94
  • 165
0

I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.

If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script

$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//;   # chop \r\n (Windows) or \n (Unix)
mob
  • 117,087
  • 18
  • 149
  • 283