Windows Perl --> Unix not working after port, possible encoding issue

Question

I've got a Perl program that I wrote on Windows. It starts with:

$unused_header = <STDIN>;
my @header_fields = split('\|\^\|', $unused_header, -1);

Which should split input that consists of a very large file of:

The|^|Quick|^|Brown|^|Fox|!|

Into:

{The, Quick, Brown, Fox|!|}

Note: This line just does the headre alone, theres another one like it to do the repetitive data lines.

It worked great on windows, but on linux it fails. However, if I define a string with the same contents within Perl, and run the split on that, it works fine.

I think it's a UTF-16 encoding handling issue, but I'm not sure how to handle it. Does anyone know how I can get perl to understand the UTF-16 being piped into STDIN?

I found: http://www.haboogo.com/matching_patterns/2009/01/utf-16-processing-issue-in-perl.html but I'm not sure what to do with it.

score 5 · Accepted Answer · answered Sep 24 '12 at 22:40

5

If STDIN is UTF-16, use one of the following

binmode(STDIN, ':encoding(UTF-16le)');   # Byte order used by Windows.
binmode(STDIN, ':encoding(UTF-16be)');   # The other byte order.
binmode(STDIN, ':encoding(UTF-16)');     # Use BOM to determine byte order.

answered Sep 24 '12 at 22:40

ikegami

367,544
15
269
518

Good call, I had to use binmode(STDIN, ':encoding(UTF-16)'); on STDIN and STDOUT since my file rewrote the files. If you just use it on one you get mismatches in printing if you have unicode characters. – John Humphreys Sep 25 '12 at 00:16

score 3 · Answer 2 · edited May 23 '17 at 10:29

3

Tom has written a lengthy answer with regards to perl and unicode. It contains some bolierplate code to properly and fully support UTF-8, but you can replace with UTF-16 as needed.

edited May 23 '17 at 10:29

Community

1
1

answered Sep 24 '12 at 22:47

hlovdal

26,565
10
94
165

score 0 · Answer 3 · answered Sep 24 '12 at 23:05

I doubt it's a UTF-xx encoding issue, as neither Windows Perl nor Unix Perl will try to read data with those encodings unless you tell it to.

If the Unix script is reading the exact same file as the Windows script but behaves differently, maybe it's a line-ending issue. The dos2unix command on most Unix-y systems can change the line endings on a file, or you can strip off the line-endings yourself in the Perl script

$unused_header = <STDIN>;
$unused_header =~ s/\r?\n$//;   # chop \r\n (Windows) or \n (Unix)

Windows Perl --> Unix not working after port, possible encoding issue

3 Answers3