1

I have a HTML string in ISO-8859-1 encoding. I need to pass this string to HTML:Entities::decode_entities() for converting some of the HTML ASCII codes to respective chars. To so i am using a module HTML::Parser::Entities 3.65 but after decode_entities() operation my whole string changes to utf-8 string. This behavior seems fine as the documentation of the HTML::Parse. As i need this string back in ISO-8859-1 format for further processing so i have used Encode::encode("iso-8859-1",$str) to change the string back to ISO-8859-1 encoding. My results are fine excepts for some chars, a question mark is coming instead. One example is single quote ' ASCII code (’)

Can anybody help me if there any limitation of Encode module? Any other pointer will also be helpful to solve the problem. I am pasting the sample text having the char causing the issue:

my $str = "This is a test string to test the encoding of some chars like ’ “ ” etc these are failing to encode; some of them which encode correctly are é « etc.";

Thanks

Zaid
  • 36,680
  • 16
  • 86
  • 155
ppant
  • 752
  • 9
  • 19
  • Can you include the complete HTML string (in ISO-8859-1? – Susheel Javadi Jun 03 '10 at 06:08
  • added the test string in question – ppant Jun 03 '10 at 06:49
  • Can you explain why you need the string back in ISO-8859-1 format? Outputting it as UTF-8 would seem the simplest solution - especially if it's going to a web browser. – Grant McLean Jun 04 '10 at 00:39
  • I know it makes sense to send everything in UTF-8 but my web application is designed for ISO-8859-1, i mean apache settings, I/O, html templates etc. So if I output a specific page in UTF-8 then in other places it created problem. I am in parallel working on various parts of my application to make it UTF-8 compatible. but this need to be solved for specific reasons. – ppant Jun 04 '10 at 03:36

2 Answers2

2

There's a third argument to encode, which controls the checking it does. The default is to use a substitution character, but you can set it to FB_CROAK to get an error message.

Snake Plissken
  • 668
  • 3
  • 8
  • 2
    thanks for suggestion. I tried and got the following error message: "\x{2019}" does not map to iso-8859-1 at /usr/lib/perl5/5.8.5/i386-linux-thread-multi/Encode.pm line 158. – ppant Jun 03 '10 at 06:59
1

The fundamental problem is that the characters represented by ’, “, and ” do not exist in ISO-8859-1. You'll have to decide what it is that you want to do with them.

Some possibilities:

Use cp1252, Microsoft's "extended" version of ISO-8859-1, instead of the real thing. It does include those characters.

Re-encode the entities outside the ISO-8859-1 range (plus &), before converting from utf-8 to ISO-8859-1:

my $toEncode = do { no warnings 'utf8'; "&\x{0100}-\x{10FFFF}" };
$string = HTML::Entities::encode_entities($string, $toEncode);

(The no warnings bit is needed because U+10FFFF hasn't actually been assigned yet.)

There are other possibilities. It really depends on what you're trying to accomplish.

cjm
  • 61,471
  • 9
  • 126
  • 175
  • cp1252 is working fine and replaces the specified characters properly but after some testing I found that the same fix doesn't work on different machine/server. I know this is not directly related to question but do you have any idea why this is not converting on some machines or what might be the other factors (apache settings etc)? – ppant Jun 04 '10 at 08:04
  • @ppant, you'll have to be more specific about what you mean by "doesn't work". Are you serving it as `charset=windows-1252`? – cjm Jun 04 '10 at 08:28
  • if i encode the string given in problem like my $EnStr = encode("cp1252",$str); then it doesn't show proper char on some machines. I was asking if "Is there is any other factor which might affects the encoding?" – ppant Jun 04 '10 at 08:54
  • When you deliver the HTML to the browser, what charset are you claiming it is? If you call it ISO-8859-1, not all browsers will handle it properly, since cp1252 is not ISO-8859-1. The official name for cp1252 is windows-1252, and you should identify it as that. – cjm Jun 04 '10 at 09:40
  • I am aiming ISO-8859-1 in browsers.. I understand your point but my issue is not with browser now even if i make a test program by using the Encode module then also it fails to give correct results in some machines by using Windows-1252. I am trying to find out any other factor which is machine dependent or module dependent – ppant Jun 04 '10 at 11:00
  • I don't know. You should probably ask another question. Include the versions of Perl, HTML::Entities, and Encode, along with the input string and output string and the exact code you're using. – cjm Jun 04 '10 at 18:29