I have a problem converting some encoded strings to utf-8.
I have a list of strings which according to the documentation are Unicode strings encoded using numeric HTML entities. Some of them are:
$str = 'WÖGER'; // seems to be WÖGER
$str = 'Jürgen'; // seems to be Jürgen
$str = 'POßNITZ'; // seems to be POßNITZ
$str = 'SCHLÄGER'; // seems to be SCHLÄGER
I want to decode them and convert to utf-8.
I tried both mb_convert_encoding() with HTML-ENTITIES
param as well as html_entity_decode(). My best result unexpectedly was with:
html_entity_decode($str, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1');
and that decoded Jürgen
successfully. However I have no luck decoding other strings from this list. I looked ISO-8859-1 encoding table and HTML codes for umlauts there differ from what I have in my list.
My question is: am I missing some obvious decoding step or is there something wrong with the source strings?
Update (2016-06-27): The original strings were indeed incorrectly encoded. These strings are the result of reading UTF-8 values in Latin-1 context and then encoding individual 1-byte chars as hex entities, so german umlaut ü
became ü
and was encoded as 2 separate chars. The accepted answer decodes them straight into UTF-8 successfully.