Converting from HTML entities to UTF-8

Question

I have a problem converting some encoded strings to utf-8.

I have a list of strings which according to the documentation are Unicode strings encoded using numeric HTML entities. Some of them are:

$str = 'W&#xc3;&#x96;GER'; // seems to be WÖGER
$str = 'J&#xc3;&#xbc;rgen'; // seems to be Jürgen
$str = 'PO&#xc3;&#x9f;NITZ'; // seems to be POßNITZ
$str = 'SCHL&#xc3;&#x84;GER'; // seems to be SCHLÄGER

I want to decode them and convert to utf-8.

I tried both mb_convert_encoding() with HTML-ENTITIES param as well as html_entity_decode(). My best result unexpectedly was with:

html_entity_decode($str, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1');

and that decoded Jürgen successfully. However I have no luck decoding other strings from this list. I looked ISO-8859-1 encoding table and HTML codes for umlauts there differ from what I have in my list.

My question is: am I missing some obvious decoding step or is there something wrong with the source strings?

Update (2016-06-27): The original strings were indeed incorrectly encoded. These strings are the result of reading UTF-8 values in Latin-1 context and then encoding individual 1-byte chars as hex entities, so german umlaut ü became Ã¼ and was encoded as 2 separate chars. The accepted answer decodes them straight into UTF-8 successfully.

I thought you wanted *UTF-8* strings, not ISO-8859. Why are you decoding to ISO-8859? — deceze, Jun 22 '16 at 15:04
Where that strings come from _really_? Looks like you read a file and then encode content to HTML entities? @nj_ is right in "_unicode characters should be represented by their codepoint, and not by encoding individual UTF-8 bytes_" but I'm afraid he is wrong trying repair it his way. — JosefZ, Jun 22 '16 at 15:25
@deceze I indeed want UTF-8 that's why I said that it was unexpected for me to get a valid UTF-8 result (eval.in page is in UTF-8) using different encoding and even more unexpectedly changing it to UTF-8 returns incorrect diacritics: [link to eval in](https://eval.in/593768). — Ruslan Bes, Jun 22 '16 at 20:16
@JosefZ they come from a third-party web-service who puts them in HTTP Request Headers using this encoding scheme (numeric HTML entities). The idea is that HTTP headers cannot hold Unicode chars so one have to encode them somehow and this is the chosen scheme. The `ENT_XML1` flag though seems to work — Ruslan Bes, Jun 22 '16 at 20:25

score 2 · Accepted Answer · answered Jun 22 '16 at 14:56

2

My understanding is, though I might be wrong, that unicode characters should be represented by their codepoint, and not by encoding individual UTF-8 bytes, which is what you have. So, Ã would be better represented using Ö or in the named form, Ö.

The ENT_XML1 flag to html_entity_decode does seem to make this work, though I'm not entirely sure what it does under the hood. If you want something more explicit:

preg_replace_callback('/&#x([A-Fa-f0-9]{2});/', function ($m) {
    return chr(hexdec($m[1]));
}, $str);

answered Jun 22 '16 at 14:56

nj_

2,219
1
10
12

That's a very broken encoding the OP has there, an explicit `preg_replace` is likely the only real way to deal with it. – deceze Jun 22 '16 at 15:07
Both solutions work. Thanks! Agreed, checking now if this is a mistake in the encoding process – Ruslan Bes Jun 22 '16 at 20:31
@deceze I agree to that. I'm also checking currently if I this is an error in the source strings. Though this solution works – Ruslan Bes Jun 22 '16 at 20:32
Here is the [answer about ENT_XML1](https://stackoverflow.com/questions/13745353/what-do-the-ent-html5-ent-html401-modifiers-on-html-entity-decode-do) by the way. – Ruslan Bes Jun 22 '16 at 20:45
@deceze I updated the post explaining the origin of the strings. You were right about the broken encoding. – Ruslan Bes Jun 27 '16 at 14:21

Converting from HTML entities to UTF-8

1 Answers1