0

I'm using Windows ActivePerl and I can never seem to get conversion of a UCS2 little endian file to convert properly to utf-8. Best i could muster is what seems a proper conversion except that the first line which is 4 characters is mangled in strange chinese/japanese characters but the rest of file seems ok.

What I really want is to do oneliner /search/replace perl regex of the usual:

perl -pi.bak -e 's/replacethis/withthat/g;' my_ucs2file.txt

That won't work so I tried to first see if perl can do proper conversion and I'm stuck, i'm using:

perl -i.BAKS -MEncode -p -e "Encode::from_to($_, 'UCS-2', 'UTF-8')" My_UCS2file.txt

I tried using UCS2 or UCS-2LE but still can't get a proper conversion.

I recall somewhere someone had to delete a couple bits or something at the beginning of a UCS2 file to get conversion working but I can't remember...

When I tried PowerShell it complained it didn't know UCS2 / UCS-2 ...??

Appreciate any ideas. I noticed NotePad++ does open it and recognize it fine and I can edit and resave in notepad but there's no commandline ability...

aschipfl
  • 33,626
  • 12
  • 54
  • 99
htfree
  • 331
  • 1
  • 15

1 Answers1

2

The one liner way is to avoid perl entirely and just use iconv -f UCS-2LE -t UTF-8 infile > outfile, but I'm not sure if that's available on Windows.

So, with perl as a one liner:

$ perl -Mopen="IN,:encoding(UCS-2LE),:std" -C2 -0777 -pe 1 infile > outfile
  • -0777 combined with -p reads entire files at a time, instead of a line at a time, which is one thing where you were going wrong - when your codepoints are 16 bits but you're treating them as 8 bit ones, finding the line separators is going to be problematic.
  • -C2 says to use UTF-8 for standard output.
  • -Mopen="IN,:encoding(UCS-2LE),:std" says that the default encoding for input streams, including standard input (So it'll work with redirected input not just files), is UCS-2LE. See the open pragma for details (In a script it'd be use open IN => ':encoding(UCS-2LE)', ':std';). Speaking of encoding, another issue you're having is that UCS-2 is a synonym for UCS-2BE. See Encode::Unicode for details.

So that just reads a file at a time, converting from UCS-2LE to perl's internal encoding, and prints it back out again as UTF-8.

If you didn't have to worry about Windows line ending conversion,

$ perl -MEncode -0777 -pe 'Encode::from_to($_, "UCS-2LE", "UTF-8")' infile > outfile

would also work.


If you want the output file to be in UCS-2LE too, and not just convert between encodings:

$ perl -Mopen="IO,:encoding(UCS-2LE),:std" -pe 's/what/ever/' infile > outfile
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • Great, thanks, i did quick test and the conversion seems to work but doesn't looks like adds empty lines when viewed in notepad++ , is it possible to do regex on the UCS2 file/stream without converting to UTF? thanks! (The end file after regex that i need should be in the same UCS2 format as the original.) – htfree May 07 '19 at 03:58
  • strange even on the converted UTF-8 file , regex like perl -pi.BAK -e 's/foo/bar/g;' PerlOut1.txt doesn't seem to work and file output is same as original even though foo is definitely the top word at the top of the file. – htfree May 07 '19 at 04:34
  • regex perl -pi.BAK -e 's/foo/bar/g;' PerlOut1.txt doesn't work, and i just noticed notepad++ thinks file is Macintosh (CR) instead of windows, maybe that's why, it shows extra empty lines, but the top line does indeed have "foo" – htfree May 07 '19 at 04:35
  • @htfree Using `IO` instead of `IN` (And dropping the `-C2`) will use the specified encoding as the default for input *and* output. But, really, avoid using UCS-2 or UTF-16 for text files as much as possible. It's way too much of a pain to work with. – Shawn May 07 '19 at 04:43
  • I don't have much choice since i'm manipulating files generated by windows built-in utils, will give it a try thanks. (sorry double posted since thought one was deleted then came back.) – htfree May 07 '19 at 05:32
  • Does not work unfortunately. perl -Mopen="IO,:encoding(UCS-2LE),:std" -0777 -pe 1 MyUCS2.txt > PerloutUCS2.txt produces file that notepad++ thinks is Ansi Macintosh CR and has NUL between every character. Is it possible to attach files here? – htfree May 07 '19 at 05:44
  • @htfree 'NUL between every character' suggests that your program is treating a UCS-2 file as a UTF-8/ASCII/etc. byte-oriented one. – Shawn May 07 '19 at 05:46
  • If you have windows you can make such a file running from commandline "icacls C:\windows\system32 /T /save Myfile.txt" Notepad++ opens the original MyUCS2.txt file just fine and identifies it as UCS2 Little Endian and windows CR LF – htfree May 07 '19 at 05:48
  • I can edit the UCS2 file with notepad++ and resave changes and works perfect and windows tool is able to read resulting file, but it doesn't have commandline usage. – htfree May 07 '19 at 05:59
  • Im trying to read through this now https://www.perlmonks.org/?node_id=608532 since before this link i had tried to use something like perl "-M5;binmode($_,':raw:encoding(UCS2-LE):crlf')for(STDIN,STDOUT)" -pe "s/foo/bar/gi" MyUCS2LE_CRLF.txt > ModifiedUCS2LE_CRLF.txt and it didn't work. – htfree May 07 '19 at 06:20
  • I'll have to scrounge up a windows computer and test there; everything I've posted is working fine on Linux for reading and creating files of the desired encoding. Might need an explicit `:crlf` added in? – Shawn May 07 '19 at 15:00
  • if you'd like i can probably setup some remote control access for you, I had already tried variations like perl -Mopen="IO,:encoding(UCS-2LE):crlf,:std" -pe 's/foo/bar/' MyUCS2_ACLS.txt > ShawnPerlout.txt but the regex doesn't do anything and the output file is then messed up by putting NUL after every letter. I think perhaps if you can convert what this post says into something i can use maybe it would work? see https://www.perlmonks.org/?node_id=689841 He explains its some workaround that fixes windows issue with crlf etc – htfree May 07 '19 at 19:40
  • I have also confirmed it works perfectly on LINUX. I verified this perl -Mopen="IO,:encoding(UCS-2LE),:std" -pe 's/foo/bar/' Windows_UCS2le.txt > PerlRegexedUCS2le.txt on a linux box using exact same UCS2le file and works. It seems as they discussed at https://www.perlmonks.org/?node_id=608532 this seems to be a bug encountered only on Windows. Maybe its an issue with commandline in windows messing things somehow see https://stackoverflow.com/questions/16598785/save-text-file-in-utf-8-encoding-using-cmd-exe – htfree May 07 '19 at 21:33
  • perhaps we can use chcp command to set winodws cmdline to ucs2-le but if utf8 is 1252 what is the code for UCS-2LE? I even tried running first "cmd /u" on DOS prompt and then running the perl command (see this interesting link https://stackoverflow.com/questions/32182619/chcp-65001-and-a-bat-file ) but the perl regex did nothing and again added NUL after every letter. – htfree May 07 '19 at 21:34
  • windows is headache so chcp 1200 doesn't work! got code from here https://learn.microsoft.com/en-us/windows/desktop/Intl/code-page-identifiers its for UTF-16 little endian which is same as UCS-2LE but seems not supported but supposedly powershell can handle ucs2-le correctly according to: https://superuser.com/questions/429069/utf-16-file-output-in-cmd-exe – htfree May 07 '19 at 21:49
  • WOW! Mind boggling, maybe I been just too sleep deprived and likely tried EVERY possible combination except the simplest one that seems to work!?!? here it is: perl -Mopen=IO,:raw:encoding(UCS-2LE) -pi.ZZZ -e "s/foo/bar/g;" Win_UCS2-LE.txt I don't get it i thought we tried this before or almost that. – htfree May 07 '19 at 23:23
  • if you add some of the stuff i wrote into your answer, along with the ending command (which can safely subsitute ucs-2le for utf-16le) I'll accept your answer as the solution. thanks for your efforts! I'll have to confirm again this does work just in case i'm so dead tired my eyes are playin' mean tricks on me! – htfree May 07 '19 at 23:26
  • [piconv](http://p3rl.org/piconv), an iconv work-alike, ships with Perl. – daxim May 08 '19 at 08:18
  • @daxim oh wow great, didn't know i had that, good to know, thanks! – htfree May 08 '19 at 22:12
  • @daxim but if i used piconv -f UCS2-LE -t UCS2-LE inputfile > piconvout i see that piconv messes up windows CRLF into Macintosh CR and if the input UCS2-LE file does not have a BOM, the output becomes ANSI with NUL after every letter. Again this is on windows with piconv, since i've confirmed linux seems to not have some issues i confirmed before, see previous comments. – htfree May 08 '19 at 22:39