4

I am changing the title because I was unaware of special broken windows characters that caused me problems, making the question look like a duplicate.

How to convert HTML entities, character references of type &#[0-9]+; and &#x[a-fA-F0-9]+;, invalid character references — and invalid windows characters chr(151) to their UTF-8 equivalents?

Basically how to clean up some very bad text of variable encoding and save it as UTF-8?

original question below

Convert &#[0-9]+; and &#x[a-fA-F0-9]+; references to UTF-8 equvalents?

for example

—
—

to —

like a browser does it, but with php.

edit: even the non-standard ones that windows made but browsers still display.

Timo Huovinen
  • 53,325
  • 33
  • 152
  • 143
  • possible duplicate of [How can I convert HTML character references (ף) to regular UTF-8?](http://stackoverflow.com/questions/3565713/how-can-i-convert-html-character-references-x5e3-to-regular-utf-8) – Gumbo Feb 09 '12 at 12:14
  • @Gumbo except the above question does not address $#[0-9]+; characters – Timo Huovinen Feb 09 '12 at 12:24
  • @Gumbo I don't know what to say, but it didn't work for me... it didn't change it at all. – Timo Huovinen Feb 09 '12 at 12:27
  • @Gumbo: Makes it related, not a dup. – Lightness Races in Orbit Feb 09 '12 at 12:27
  • You’re right, there was an error in it (needed to be `UTF-32BE` instead of `UTF-16BE`). – Gumbo Feb 09 '12 at 12:29
  • @LightnessRacesinOrbit So what? Should I post the same answer here again? – Gumbo Feb 09 '12 at 12:31
  • @Gumbo still does not work for `—` or `—`, var dumping your $str after conversion returns `string(2) ""`. Also Gumbo, I made this question because I thought that a more complete solution was missing. – Timo Huovinen Feb 09 '12 at 12:35
  • @Gumbo: Maybe! But tailoring it for _this_ question. It's a bit of a hole in the close reasons IMO; the question is not a duplicate at all, though it does cover similar ground and an expansive answer such as yours can overlap several narrow questions. – Lightness Races in Orbit Feb 09 '12 at 12:36
  • @YuriKolovsky [U+0097](http://www.unicode.org/charts/PDF/U0080.pdf) is a non-visible character. It is encoded in UTF-8 with 0xC297. If you take the return value of `html_dereference` and turn it into a hexadecimal representation with `unpack('H*', $str)` you’ll get exactly that value. – Gumbo Feb 09 '12 at 12:48
  • @LightnessRacesinOrbit No, it is a duplicate. The other question asks for how to turn numeric character references into UTF-8. So does this. It doesn’t quite matter if the other question does only ask for hexadecimal representations and not for decimal representations only. – Gumbo Feb 09 '12 at 12:50
  • @Gumbo shall I make the question even more generic by renaming it to "turn numeric character references into UTF-8"? because there are `♥` `—` `U+0097` and all those other characters that should be turned into UTF-8 – Timo Huovinen Feb 09 '12 at 12:59
  • @YuriKolovsky No, it still would be a duplicate of the [other question](http://stackoverflow.com/q/3565713/53114). – Gumbo Feb 09 '12 at 13:00
  • 1
    @Gumbo I didn't mean "rename it so it's no longer a duplicate", but to have a clearer title and maybe handle other cases of numeric character conversion to unicode. – Timo Huovinen Feb 09 '12 at 13:01
  • @YuriKolovsky Then go and edit the other question if the `ף` is too specific. – Gumbo Feb 09 '12 at 13:06
  • I think we're splitting hairs now. – Lightness Races in Orbit Feb 09 '12 at 13:12
  • @Gumbo: Check the links in my comments (particularly [this](http://www.cs.tut.fi/~jkorpela/www/windows-chars.html)). What do you think the best way to actually solve decoding these "special Windows character HTML entities" is? – thirtydot Feb 09 '12 at 13:16
  • @YuriKolovsky Sorry, I think I don’t get this. `—` refers to the ISO 10646 (UCS) character U+0097 *APPLICATION PROGRAM COMMAND* that is a control character and thus not visible. In UTF-8, U+0097 is encoded with the byte sequence 0xC297. My solution does turn `—` into the byte sequence 0xC297 too. So what’s the point with this would be not working? – Gumbo Feb 09 '12 at 13:22
  • 1
    @thirtydot Now I get what you’re talking about. Didn’t know that. And I’m quite astounded that browser vendors did adopt this weird behavior. In that case I wouldn’t consider this question a duplicate. – Gumbo Feb 09 '12 at 13:44

2 Answers2

5

Answering my own question with the solution that I used in the end

The problem:

I needed to replace html entities and decimal and hexadecimal character references that looked like this ‚ and ‚ and &#emdash; to their UTF-8 equvalents, like a normal browser would, and convert the text into UTF-8.

The problem was that there were often references that were in the range of 130-150 and x82-x9F, which as thirtydot has found out were invalid windows word characters that people use with ASCII text for special characters like emdashes, which are not supported by php's html_entity_decode.

You would think that these invalid characters would not work in browsers, but it looks like browsers made a silent undocumented agreement to fix these characters and display them properly anyway.

While trying to fix these references I also found out that the actual characters like <?php echo chr(151);?> were also being used, which were probably directly copied from word, and would cause all sorts of problems, so I needed them to be fixed too.

What most answers that I found regarding encodings fail to mention is that the solution to encoding related problems often largely depends on the encoding used. Here is an example:

The invalid windows character chr(151) will work with "ISO-8859-1" encoded text, and Josh B mentions as per Jukka Korpelas suggestion that you should fix them like this:

$str = str_replace(chr(151),'--',$str);

What it does is replace the windows character to a safe ASCII alternative, but knowing that the text will be stored in UTF-8, I did not want to loose the original characters. While changing them like this was not an option because ASCII does not support the proper Unicode character:

$str = str_replace(chr(151),chr(8218),$str);

So what I did instead was to first replace the character to its html reference (While the $str was "ISO-8859-1" encoded:

$str = str_replace(chr(151),'&#8218;'),$str);

Then I change the encoding

$str = iconv('ISO-8859-1', 'UTF-8//IGNORE', $str);//convert to UTF-8

And finally I turn all the entities and character references to pure UTF-8 with my "html_character_reference_decode" function that is largely based on Gumbos solution, which also fixes the bad windows references, but only uses preg_replace_callback to go over the bad windows characters.

function fix_char_mapping($match){
    if (strtolower($match[1][0]) === "x") {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    $mapping = array(8218,402,8222,8230,8224,8225,710,8240,352,8249,338,141,142,143,144,8216,8217,8220,8221,8226,8211,8212,732,8482,353,8250,339,157,158,376); 
    $codepoint = $mapping[$codepoint-130];
    return '&#'.$codepoint.';';
}
function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true){
    if($fixMappingBug){
        $string = preg_replace_callback('/&#(1[3-5][0-9]|x8[2-9a-f]|x9[0-9a-f]);/i','fix_char_mapping',$string);
    }
    return html_entity_decode($string, ENT_QUOTES, 'UTF-8');
}
header('Content-Type: text; charset=UTF-8');
echo  html_character_reference_decode('dash &#151; and another dash &#x97; text &#x5D5; and more tests &#x5E0;&#x5D5;&#x5E3; ');

So if your text is "ISO-8859-1" encoded, the complete solution looks like this:

<?php
header('Content-Type: text/plain; charset=utf-8');
ini_set("default_charset", 'utf-8');
error_reporting(-1);
$encoding = 'ISO-8859-1';//put encoding here
$str = '&#x9F; &#x9C; bad&#150;string: '.chr(151);//ASCII
if($encoding==='ISO-8859-1'){
//fix bad windows characters
$badchars = array(
'&#130;'=>chr('130'),//',' baseline single quote
'&#131;'=>chr('131'),//'NLG' florin
'&#132;'=>chr('132'),//'"' baseline double quote
'&#133;'=>chr('133'),//'...' ellipsis
'&#134;'=>chr('134'),//'**' dagger (a second footnote)
'&#135;'=>chr('135'),//'***' double dagger (a third footnote)
'&#136;'=>chr('136'),//'^' circumflex accent
'&#137;'=>chr('137'),//'o/oo' permile
'&#138;'=>chr('138'),//'Sh' S Hacek
'&#139;'=>chr('139'),//'<' left single guillemet
'&#140;'=>chr('140'),//'OE' OE ligature
'&#145;'=>chr('145'),//"'" left single quote
'&#146;'=>chr('146'),//"'" right single quote
'&#147;'=>chr('147'),//'"' left double quote
'&#148;'=>chr('148'),//'"' right double quote
'&#149;'=>chr('149'),//'-' bullet
'&#150;'=>chr('150'),//'-' endash
'&#151;'=>chr('151'),//'--' emdash
'&#152;'=>chr('152'),//'~' tilde accent
'&#153;'=>chr('153'),//'(TM)' trademark ligature
'&#154;'=>chr('154'),//'sh' s Hacek
'&#155;'=>chr('155'),//'>' right single guillemet
'&#156;'=>chr('156'),//'oe' oe ligature
'&#159;'=>chr('159'),//'Y' Y Dieresis
);
$str = str_replace(array_values($badchars),array_keys($badchars),$str);
$str = iconv('ISO-8859-1', 'UTF-8//IGNORE', $str);//convert to UTF-8
$str = html_character_reference_decode($str);//fixes bad entities above
echo $str;die;
}

It was tested with a wide range of situations and looks like it works.

Lets look at the same situation with UTF-8 encoded text that contains bad windows characters.

One reliable way to test for the presence of bad characters or "badly formed UTF-8" was to use iconv, it is slow, but was more reliable than using preg_match in my tests:

$cleaned = iconv('UTF-8','UTF-8//IGNORE',$str);
if ($cleaned!==$str){
    //contains bad characters, use cleaned version where the bad characters were stripped
    $str = $cleaned;
}

This was pretty much the best I could think of, as I found no reasonable way to find and replace the bad windows characters in UTF-8 text, let me explain why.

lets take a string with a perfectly valid unicode character $str = "—".chr(151); and a bad windows emdash.

I don't know what bad windows characters might be present in the UTF-8 string, only that they might be present.

Using str_replace to try and fix the bad windows character chr(148) (right double quote) in the above valid emdash string which does not even contain any double quotes will result in a scrambeled character, at first I thought that str_replace might not be multibyte safe, and tried using mb_eregi_replace but the problem was the same.

The comments on the php website and stackoverflow mention that str_replace is binary safe, and works fine with well formed UTF-8 text, because of the way that UTF-8 was designed.

Why it breaks

It figures that the bad windows character chr(148) is made up of the following bits "10010100", while the (emdash character)(http://www.fileformat.info/info/unicode/char/2014/index.htm), which according to the fileformat website is made up of 3 bytes: "11100010:10000000:10010100"

Notice that the bits in the last byte in the perfectly valid UTF-8 character match the bits in the bad windows right double quote, so str_replace just replaces the last byte, breaking the UTF-8 character. This problem happens with lots of unicode characters, and would scramble lots of characters in russian text for example.

This can't happen with ASCII text because each character is always made up of a single byte.

So when you get an UTF-8 string, that contains any amount of multibyte characters, you can no longer safely fix the bad windows characters, and the only solution I found was to strip them with iconv

$str = iconv('UTF-8', 'UTF-8//IGNORE', $str);

The only solution that I can think of

Although you can always replace the valid unicode characters that contain a byte of the bad characters to their encoded counterparts, then replace the bad characters and then decode the good characters, thus keeping everything :)

like this:

  1. replace 11100010:10000000:10010100 with the encoding like &#8212;
  2. then replace 10010100 with the proper em dash &mdash;
  3. then decode &#8212; back to 11100010:10000000:10010100

But you have to write down every multibyte character that contains bytes that match the bad characters to achieve this.

Related: What is the difference between EM Dash #151; and #8212;?

Community
  • 1
  • 1
Timo Huovinen
  • 53,325
  • 33
  • 152
  • 143
1

This is much more complicated than I thought it was when I wrote my answer.

Gumbo has updated his answer to a very similar question, so just read that:

How can I convert HTML character references (&#x5E3;) to regular UTF-8?

Community
  • 1
  • 1
thirtydot
  • 224,678
  • 48
  • 389
  • 349
  • html_entity_decode(): only works for characters in get_html_translation_table() which is a very small list of references and characters. – Timo Huovinen Feb 09 '12 at 12:28
  • No, it also works for numeric entities. Take a look at the PHP source. [This](http://svn.php.net/viewvc/php/php-src/branches/PHP_5_3/ext/standard/html.c?view=markup#l1382) leads to [this](http://svn.php.net/viewvc/php/php-src/branches/PHP_5_3/ext/standard/html.c?view=markup#l916) which contains [this](http://svn.php.net/viewvc/php/php-src/branches/PHP_5_3/ext/standard/html.c?view=markup#l1007). I'm not sure why it doesn't work for `—` or `—`, but it *does* work for `—`. All three of those mean `—`. None of the solutions I've seen linked manage to decode `—` or `—`. – thirtydot Feb 09 '12 at 12:50
  • @thirtybot none of them do, which is exactly why I posted this question. – Timo Huovinen Feb 09 '12 at 13:03
  • 1
    I'm reading this: http://www.cs.tut.fi/~jkorpela/www/windows-chars.html. It's starting to make sense. PHP is evidently not handing this weirdness in the same way browsers do. – thirtydot Feb 09 '12 at 13:07
  • thanks for the link, it does explain why PHP did not handle it, but my problem is that I still need it handled :S – Timo Huovinen Feb 09 '12 at 13:11
  • @thirtybot Found a list of conversions here by searching "Jukka Korpela" http://php.net/manual/fr/function.chr.php search – Timo Huovinen Feb 09 '12 at 13:30
  • 1
    @thirtydot - The character override mapping is formally specified in the HTML5 spec here: http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides – Alohci Feb 09 '12 at 14:22
  • I tried using Gumbos solution again, but still does not work, just converts the `—` into `?`, then I noticed your comment about the $encoding, which I fixed in his example, and now his function returns `—`, which is still not the — (em dash) that I'm looking for. – Timo Huovinen Feb 09 '12 at 17:52
  • **edit** Never-mind, the problem disappeared on it's own, must have been some caching issue. works now! – Timo Huovinen Feb 09 '12 at 18:01
  • Gumbo just fixes the mapping using `$mapping = array(8218,402,8222,8230,8224,8225,710,8240,352,8249,338,141,142,143,144,8216,8217,8220,8221,8226,8211,8212,732,8482,353,8250,339,157,158,376); $codepoint = $mapping[$codepoint-130];`, which is sufficient to get things working. – Timo Huovinen Feb 09 '12 at 18:10