4

I am writing a script that takes a UTF-16 encoded text file as input and outputs a UTF-16 encoded text file.

use open "encoding(UTF-16)";

open INPUT, "< input.txt"
   or die "cannot open > input.txt: $!\n";
open(OUTPUT,"> output.txt");

while(<INPUT>) {
   print OUTPUT "$_\n"
}

Let's just say that my program writes everything from input.txt into output.txt.

This WORKS perfectly fine in my cygwin environment, which is using "This is perl 5, version 14, subversion 2 (v5.14.2) built for cygwin-thread-multi-64int"

But in my Windows environment, which is using "This is perl 5, version 12, subversion 3 (v5.12.3) built for MSWin32-x64-multi-thread",

Every line in output.txt is pre-pended with crazy symbols except the first line.

For example:

<FIRST LINE OF TEXT>
਀    ㈀  ㄀Ⰰ ㈀Ⰰ 嘀愀 ㌀ 䌀栀椀愀 䐀⸀⸀⸀  儀甀愀渀最 䠀ഊ<SECOND LINE OF TEXT>
...

Can anyone give some insight on why it works on cygwin but not windows?

EDIT: After printing the encoded layers as suggested.

In Windows environment:

unix
crlf
encoding(UTF-16)
utf8
unix
crlf
encoding(UTF-16)
utf8

In Cygwin environment:

unix
perlio
encoding(UTF-16)
utf8
unix
perlio
encoding(UTF-16)
utf8

The only difference is between the perlio and crlf layer.

allenylzhou
  • 1,431
  • 4
  • 19
  • 36
  • Perhaps those "crazy symbols" are windows not displaying UTF16 in whatever you're using to view them ;) – Brian Roach Oct 28 '12 at 00:32
  • I am using Notepad++ to display output.txt. It works fine if I use cygwin to run the script and generate the file, but it is also full of crazy symbols when I use windows to run the script – allenylzhou Oct 28 '12 at 00:48
  • Try upgrading your Windows Perl to 5.14 or 5.16, that will eliminate the possibility this is a 5.12 bug. Either [Strawberry Perl](http://strawberryperl.com/) or [ActivePerl](http://www.activestate.com/activeperl/downloads). – Schwern Oct 28 '12 at 01:24

2 Answers2

5

[ I was going to wait and give a thorough answer, but it's probably better if I give you a quick answer than nothing. ]

The problem is that crlf and the encoding layers are in the wrong order. Not your fault.

For example, say you do print "a\nb\nc\n"; using UTF-16le (since it's simpler and it's probably what you actually want). You'd end up with

61 00 0D 0A 00 62 00 0D 0A 00 63 00 0D 0A 00

instead of

61 00 0D 00 0A 00 62 00 0D 00 0A 00 63 00 0D 00 0A 00

I don't think you can get the right results with the open pragma or with binmode, but it can be done using open.

open(my $fh, '<:raw:encoding(UTF-16):crlf', $qfn)

You'll need to append a :utf8 with some older version, IIRC.

It works on cygwin because the crlf layer is only added on Windows. There you'd get

61 00 0A 00 62 00 0A 00 63 00 0A 00
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I don't fully understand the purpose of these different encoding layers work. But this solved my problem: open my $output,">:raw:encoding(UTF-16)", "output.txt"; Appending :crlf did not seem to make a difference (which is surprising since you said the problem arises from the wrong order). But prepending :raw is necessary (same problem occurs otherwise) – allenylzhou Oct 28 '12 at 08:49
  • The difference with and without :crlf is the line endings used (CR LF vs LF) – ikegami Oct 28 '12 at 19:06
4

You have a typo in your encoding. It should be use open ":encoding(UTF-16)" Note the colon. I don't know why it would work on Cygwin but not Windows, but could also be a 5.12 vs 5.14 thing. Perl seems to make up for it, but it could be what's causing your problem.

If that doesn't do it, check if the encoding is being applied to your filehandles.

print map { "$_\n" } PerlIO::get_layers(*INPUT);
print map { "$_\n" } PerlIO::get_layers(*OUTPUT);

Use lexical filehandles (ie. open my $fh, "<", $file). Glob filehandles are global and thus something else in your program might be interfering with them.

If all that checks out, if lexical filehandles are getting the encoding(UTF-16) applied, let us know and we can try something else.

UPDATE: This may provide your answer: "BOMed UTF files are not suitable for streaming models, and they must be slurped as binary files instead." Looks like you have to read the file in as binary and do the encoding as a string. This may have been a bug fixed in 5.14.

UPDATE 2: Yep, I can confirm this is a bug that was fixed in 5.14.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • As you suggested, I added the colon and changed to use lexical file handles but it had no effect. Please see the edit to my question for the print outputs. The only difference was that there was a crlf layer in Windows environment, and a perlio layer in cygwin envrionment. – allenylzhou Oct 28 '12 at 01:16
  • @aylz5073 See update. You may have bumped into a UTF-16 encoding bug in 5.12. – Schwern Oct 28 '12 at 01:31
  • I just tried it with ActivePerl 5.16 and it did not eliminate the problem. One other observation I want to make is that if I change the encoding from ":encoding(UTF-16)" to ":encoding(UTF-16LE)", then output.txt becomes some form of binary file full of NUL markers instead of just pre-pending some weird symbols to my lines of text as shown in my original post. I guess I will try the solution in the link you provided and keep you updated. – allenylzhou Oct 28 '12 at 01:44
  • @aylz5073 The only thing I can think at this point is to give Strawberry Perl a shot. – Schwern Oct 28 '12 at 05:48
  • It's got nothing to do with the BOM. Use UTF-16le and you'll get the same prob. See my answer. – ikegami Oct 28 '12 at 07:01
  • Thanks for both your help. I really appreciate it. – allenylzhou Oct 28 '12 at 08:51