perl has an option perl -C
to process utf-8, is it possible to tell perl one-liner the input is in utf-16 encoding? The BEGIN block might be used to change encoding explicitly, any simpler way there?
-
2How about `use open ....` or even the -M flag from `perlrun`? – tjd Jan 05 '15 at 16:32
-
@tjd I am wondering your full solution:) – Thomson Jan 05 '15 at 16:35
-
3On Windows, you can't really because :crlf and :encoding would end up in the wrong order. – ikegami Jan 05 '15 at 16:37
-
@ikegami interesting. Why the order is wrong only on Windows for :crlf and :encoding? – Thomson Jan 05 '15 at 16:53
-
1:crlf is only added on Windows. If other builds added :crlf, then they would have the problem too. – ikegami Jan 05 '15 at 17:17
-
Unicode is hard and `:crlf` `:bytes` ... so to speak. [UTF-16 Perl input output](http://stackoverflow.com/questions/13105361/utf-16-perl-input-output) might be of help. – G. Cito Jan 08 '15 at 18:34
-
@Thomson If it is a file that uses little endian encoding something like the example I added to my answer might work: `perl -00 -MEncode="encode,decode" -E 'binmode(STDOUT, ":bytes"); $text = decode("UTF-16LE", <>) ; print encode("UTF-16LE", $text)' UTF16.txt`. But that might not be your problem ... :-\ – G. Cito Jan 08 '15 at 18:36
2 Answers
Can Encode
do what you want? You then might have to use encode()
and decode()
in your script so it might be no shorter than:
perl -nE 'BEGIN {binmode STDIN, ":encoding(utf16)" } ; ...'
There is a PERL_UNICODE
environment variable, but it is fairly limited: it simply mimics -C
if I recall correctly.
I once tried to find out why there aren't -C
switches for "popular" forms of UTF and it seemed to come down to whether or not they are frequently used; are or are not well understood (endianness sometimes counts - who knew?); are - or should be - obsolete; ... : in other words it's not as simple as it seems.
perl -MEncode -E 'say for Encode->encodings(":all")'
will show ~ 9 different UTF encodings.In addtion to the usual suspects (
perlrun
,perlunitut
,perlunicode
, etc.), one of the most interesting perl resources on Unicode is right here on Stackoverflow and makes for fascinating reading.
c.f. @Leon Timmerman's example and perldoc open
which is fairly thorough:
% perl -Mopen=":std,:encoding(utf-16)" -E 'print <>' UTF16.txt > other.txt
% file other.txt
other.txt: Big-endian UTF-16 Unicode text, with CRLF line terminators
Edit: Another recent discussion asking how to "Turn Off" binmode(STDOUT, ":utf8") Locally touches on PerlIO and "layers" and has a neat solution that might lend itself to a one-liner. See UTF-16 perl input output as well.
I will try to find a real example using Encode
to preserve encoding that can be one-lined. It would go something like this "round trip". e.g.:
% file UTF16.txt
UTF16.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
... slurp it up and redirect it to a different file:
% perl -00 -MEncode="encode,decode" -E '
$text = decode("UTF-16LE", <>) ;
print encode("UTF-16LE", $text)' UTF16.txt > other.txt
% file other.txt
other.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
diff
and print the size of the file in bytes:
% diff UTF16.txt other.txt
% perl -E 'say [stat]->[7] for @ARGV' UTF16.txt other.txt
2220
2220
-
I ran your last command on Windows with Perl 5.14, against a utf-16 file which can be view by many native application in Windows, like `type`, `notepad`, but perl complains "UTF-16:Unrecognised BOM 7061" – Thomson Jan 08 '15 at 11:08
-
I'm not familiar with how perl on Windows interacts with the various PerlIO layers, but [`perlrun`](http://perldoc.perl.org/perlrun.html) describes many of the options (`:crlf` *etc.*) - a solution might lie there. In your case perhaps the text is `BOM`-ed (Byte Order Marked) and needs Little Endian/Big Endian encodings? – G. Cito Jan 08 '15 at 15:32
-
If operating system/software vendors and the Unicode Consortium have yet to provide a really easy to use robust standard it is probably because languages and writing systems are not really easy to use, encode, decode, store over long periods of time, translate, ... even on paper. – G. Cito Jan 08 '15 at 15:41
-
@Thomson When dealing with Unicode, IO, layers, `:bytes` and `:crlf` it can be difficult to make a "portable" one liner that works on Windows and the `Unix/Linux/BSD/Solaris/OSX` family. Now I have question too. – G. Cito Jan 08 '15 at 17:33
-
http://www.perlmonks.org/?node_id=986776 shows how to remove BOM to "unmark" a UTF-16LE encoded document. It sounds scary enough to make a backup. – G. Cito Jan 08 '15 at 20:26
-
thanks. I am not sure why the encoding changing does work for the empty diamond operator on Windows, I opened as the same file explicitly with UTF-16LE, and it works. `perl -e"open(I,'<:encoding(UTF-16LE)', $ARGV[0]);while () {print}" someutf16file` – Thomson Jan 09 '15 at 03:58
You can do that using perl -Mopen=":std,IN,:encoding(utf-16)" -e '...'

- 30,029
- 2
- 61
- 110
-
This one works well on Windows. I think `IN` is necessary and the key here. Could you explain a little more about this syntax? – Thomson Jan 09 '15 at 04:01