0

I have a script that I run on various texts to convert XHTML (e.g., ü) to ASCII. For Example, my script is written in the following manner:

open (INPUT, '+<file') || die "File doesn't exist! $!";
open (OUTPUT, '>file') || die "Can't find file! $!";

while (<INPUT>) {
     s/&uuml/ü/g;
}

print OUTPUT $_;

This works as expected and substitutes the XHTML with the ASCII equivalent. However, since this is often run, I've attempted to convert it into a module. But, Perl doesn't return "ü" it returns the decomposition.
How can I get Perl to return the data back with the ASCII equivalent (as run and printed in my regular .pl file)?

Toto
  • 89,455
  • 62
  • 89
  • 125

1 Answers1

3

There is no ASCII. Not in practice anyway, and certainly not outside the US. I suggest you specify an encoding that will have all characters you might encounter (ASCII does not contain ü, it is only a 7-bit encoding!). Latin-1 is possible, but still suboptimal, so you should use Unicode, preferably UTF-8.

If you don't want to output in Unicode, at least your Perl script should be encoded with UTF-8. To signal this to the perl interpreter, use utf8 at the top of your script.

Then open the input file with an encoding layer like this:

open my $fh, "<:encoding(UTF-8)", $filename

The same goes for the output file. Just make sure to specify an an encoding when you want to use one.

You can change the encoding of a file with binmode, just see the documentation.

You can also use the Encode module to translate a byte string to unicode and vice versa. See this excellent question for further information about using Unicode with Perl.

If you want to, you can use the existing HTML::Entities module to handle the entity decoding and just focus in the I/O.

Community
  • 1
  • 1
amon
  • 57,091
  • 2
  • 89
  • 149
  • Thanks for the tips on Unicode. I am familiar with these practices in Perl, and with the HTML::Entities. But your comments don't answer the question. I am referencing the high ASCII 0x9F encoding, whether ASCII exists outside the US is superfluous to the question. And yes, I am fully aware of the added benefits of Unicode. My question is how come the Script file (.pl) substitutes these correctly, but if I write the same script into a subtroutine, which is implemented using a .pm (with ISA qw(Exporter), it doesn't return 0x9F, but the decomposition. Thanks – user1628415 Sep 03 '12 at 15:20
  • @user1628415 The character `0x9F` doesn't exist in most encodings, and I am unaware of a "high ASCII" encoding. The given code maps to `Ÿ` in Windows-1252, but I can't find any other reference. What specific encoding are you using? ASCII is only defined up to `0x7F`. – amon Sep 03 '12 at 15:33
  • The encoding is MacRoman, and, yes, 0x7F, is the last encoding for 0-127, but 0x9F is in the 128-255 range, respectively. Thanks for the help. – user1628415 Sep 03 '12 at 17:08
  • @user1628415 you can use the *MacRoman* encoding with the [`Encoding::Byte`](http://search.cpan.org/~dankogai/Encode-2.47/Byte/Byte.pm) module. Although using Unicode is still preferable ;-) – amon Sep 03 '12 at 17:17