3

I have a file in Unicode format on a windows machine. Is there any way to convert it to ASCII format on a windows machine using perl script

It's UTF-16 BOM.

ashokbabuy
  • 1,000
  • 10
  • 17
  • 1
    Describe exactly what you mean by "Unicode format"? UTF-8, UTF-16 or UTF-32. With or without BOM? – David Heffernan Nov 15 '11 at 20:43
  • 3
    And how exactly would you convert it to ASCII? If it contains Götterdämmerung or 뀋 what should they be converted into? – tripleee Nov 15 '11 at 20:47
  • http://stackoverflow.com/questions/1490218/utf-16-to-ascii-conversion-in-java – Wayne Nov 15 '11 at 21:05
  • The answers of Karsten Silkenbäumer and David W. can be reduced to the command [piconv](http://p3rl.org/piconv) which ships with Perl, so just use that. – In http://stackoverflow.com/q/1970660#1974459 I show alternative code that is not so destructive to non-ASCII characters. – daxim Nov 16 '11 at 11:30
  • @daxim - Thanks for the pointer to `piconv`. I never knew about that one before. I'll look at your answer in the other Stack overflow question too. – David W. Nov 16 '11 at 16:57
  • yes, piconv is a tool. I thought @ashokbabuy wanted to have perl code to reuse. – Karsten S. Nov 19 '11 at 08:51

2 Answers2

9

If you want to convert unicode to ascii, you must be aware that some characters can't be converted, because they just don't exist in ascii. If you can live with that, you can try this:

#!/usr/bin/env perl
use strict;
use warnings;
use autodie;

use open IN => ':encoding(UTF-16)';
use open OUT => ':encoding(ascii)';

my $buffer;

open(my $ifh, '<', 'utf16bom.txt');
read($ifh, $buffer, -s $ifh);
close($ifh);

open(my $ofh, '>', 'ascii.txt');
print($ofh $buffer);
close($ofh);

If you do not have autodie, just remove that line - you should then change your open/close statements with a

open(...) or die "error: $!\n";

If you have characters that can't be converted, you will get warnings on the console and your output file will have e.g. text like

\x{00e4}\x{00f6}\x{00fc}\x{00df}

in it. BTW: If you don't have a mom but know it is Big Endian (Little Endian), you can change the encoding line to

use open IN => ':encoding(UTF-16BE)';

or

use open IN => ':encoding(UTF-16LE)';

Hope it works under Windows as well. I can't give it a try right now.

Karsten S.
  • 2,349
  • 16
  • 32
  • 3
    «use open …» and autodie clash with each other! The latter will ignore the former. Both of them are severely flawed in their own way, but «use open» gives you so little that you should probably avoid it. – Leon Timmermans Nov 16 '11 at 02:02
3

Take a look at the encoding option on the Perl open command. You can specify the encoding when opening a file for reading or writing:

It'd be something like this would work:

#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say switch);
use Data::Dumper;

use autodie;

open (my $utf16_fh, "<:encoding(UTF-16BE)", "test.utf16.txt");
open (my $ascii_fh, ">:encoding(ASCII)", ".gvimrc");

while (my $line = <$utf16_fh>) {
    print $ascii_fh $line;
}

close $utf16_fh;
close $ascii_fh;
David W.
  • 105,218
  • 39
  • 216
  • 337
  • UTF-16 BOM and UTF-16BE are not (necessarily) the same thing. You want `:encoding(UTF-16)` to determine the endianness from the BOM. – cjm Nov 16 '11 at 06:09
  • (In fact, UTF-16le is much more likely) – ikegami Nov 16 '11 at 09:02
  • This is the test program I ran on my computer. I used VIM to convert a file to UTF-16, then tried converting it back to ascii using Perl. The first time I had _UTF-16_, but then I got `UTF-16:Unrecognised BOM 22`. I then decided to try _UTF-16LE_ since that was the most likely, but I got a bunch of `"\x{4100}" does not map to ascii`. Changing it to _UTF-16BE_ worked. All I can say is that __This worked on _my_ computer__. I sort of suspected that there might be issues with my exact code which is why I pointed out the `open` command on Perldoc and why I said "_It'd be something like this_" – David W. Nov 16 '11 at 16:55
  • So you're reply answer that your program doesn't match the OP's spec is that your code work with an entirely different format (UTF-16be with no BOM) on your machine? – ikegami Nov 16 '11 at 23:27