Process text as utf-16 via perl one-liner?

Question

perl has an option perl -C to process utf-8, is it possible to tell perl one-liner the input is in utf-16 encoding? The BEGIN block might be used to change encoding explicitly, any simpler way there?

How about `use open ....` or even the -M flag from `perlrun`? — tjd, Jan 05 '15 at 16:32
On Windows, you can't really because :crlf and :encoding would end up in the wrong order. — ikegami, Jan 05 '15 at 16:37
@ikegami interesting. Why the order is wrong only on Windows for :crlf and :encoding? — Thomson, Jan 05 '15 at 16:53
:crlf is only added on Windows. If other builds added :crlf, then they would have the problem too. — ikegami, Jan 05 '15 at 17:17
Unicode is hard and `:crlf` `:bytes` ... so to speak. [UTF-16 Perl input output](http://stackoverflow.com/questions/13105361/utf-16-perl-input-output) might be of help. — G. Cito, Jan 08 '15 at 18:34
@Thomson If it is a file that uses little endian encoding something like the example I added to my answer might work: `perl -00 -MEncode="encode,decode" -E 'binmode(STDOUT, ":bytes"); $text = decode("UTF-16LE", <>) ; print encode("UTF-16LE", $text)' UTF16.txt`. But that might not be your problem ... :-\ — G. Cito, Jan 08 '15 at 18:36

score 3 · Accepted Answer · edited May 23 '17 at 12:20

Can Encode do what you want? You then might have to use encode() and decode() in your script so it might be no shorter than:

    perl -nE 'BEGIN {binmode STDIN, ":encoding(utf16)" } ; ...'

There is a PERL_UNICODE environment variable, but it is fairly limited: it simply mimics -C if I recall correctly.

I once tried to find out why there aren't -C switches for "popular" forms of UTF and it seemed to come down to whether or not they are frequently used; are or are not well understood (endianness sometimes counts - who knew?); are - or should be - obsolete; ... : in other words it's not as simple as it seems.

perl -MEncode -E 'say for Encode->encodings(":all")' will show ~ 9 different UTF encodings.
In addtion to the usual suspects (perlrun, perlunitut, perlunicode, etc.), one of the most interesting perl resources on Unicode is right here on Stackoverflow and makes for fascinating reading.

c.f. @Leon Timmerman's example and perldoc open which is fairly thorough:

% perl -Mopen=":std,:encoding(utf-16)" -E 'print <>' UTF16.txt > other.txt
% file other.txt 
other.txt: Big-endian UTF-16 Unicode text, with CRLF line terminators

Edit: Another recent discussion asking how to "Turn Off" binmode(STDOUT, ":utf8") Locally touches on PerlIO and "layers" and has a neat solution that might lend itself to a one-liner. See UTF-16 perl input output as well.

I will try to find a real example using Encode to preserve encoding that can be one-lined. It would go something like this "round trip". e.g.:

% file UTF16.txt
UTF16.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators

... slurp it up and redirect it to a different file:

% perl -00 -MEncode="encode,decode"  -E '
  $text = decode("UTF-16LE", <>) ;  
  print encode("UTF-16LE", $text)' UTF16.txt > other.txt
% file other.txt
other.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators

diff and print the size of the file in bytes:

% diff UTF16.txt other.txt
% perl -E 'say [stat]->[7] for @ARGV' UTF16.txt other.txt
2220
2220

I ran your last command on Windows with Perl 5.14, against a utf-16 file which can be view by many native application in Windows, like `type`, `notepad`, but perl complains "UTF-16:Unrecognised BOM 7061" — Thomson, Jan 08 '15 at 11:08
I'm not familiar with how perl on Windows interacts with the various PerlIO layers, but [`perlrun`](http://perldoc.perl.org/perlrun.html) describes many of the options (`:crlf` *etc.*) - a solution might lie there. In your case perhaps the text is `BOM`-ed (Byte Order Marked) and needs Little Endian/Big Endian encodings? — G. Cito, Jan 08 '15 at 15:32
If operating system/software vendors and the Unicode Consortium have yet to provide a really easy to use robust standard it is probably because languages and writing systems are not really easy to use, encode, decode, store over long periods of time, translate, ... even on paper. — G. Cito, Jan 08 '15 at 15:41
@Thomson When dealing with Unicode, IO, layers, `:bytes` and `:crlf` it can be difficult to make a "portable" one liner that works on Windows and the `Unix/Linux/BSD/Solaris/OSX` family. Now I have question too. — G. Cito, Jan 08 '15 at 17:33
http://www.perlmonks.org/?node_id=986776 shows how to remove BOM to "unmark" a UTF-16LE encoded document. It sounds scary enough to make a backup. — G. Cito, Jan 08 '15 at 20:26
thanks. I am not sure why the encoding changing does work for the empty diamond operator on Windows, I opened as the same file explicitly with UTF-16LE, and it works. `perl -e"open(I,'<:encoding(UTF-16LE)', $ARGV[0]);while () {print}" someutf16file` — Thomson, Jan 09 '15 at 03:58

score 3 · Answer 2 · answered Jan 05 '15 at 17:15

3

You can do that using perl -Mopen=":std,IN,:encoding(utf-16)" -e '...'

answered Jan 05 '15 at 17:15

Leon Timmermans

30,029
2
61
110

This one works well on Windows. I think `IN` is necessary and the key here. Could you explain a little more about this syntax? – Thomson Jan 09 '15 at 04:01

Process text as utf-16 via perl one-liner?

2 Answers2