35

Im doing some data cleansing on some messy data which is being imported into mysql.

The data contains 'pseudo' unicode chars, which are actually embedded into the strings as 'u00e9' etc.

So one field might be.. 'Jalostotitlu00e1n' I need to rip out that clumsy 'u00e1n' and replace it with the corresponding utf character

I can do this in either mysql, using substring and CHR maybe, but Im preprocssing the data via PHP, so I could do it there also.

I already know all about how to configure mysql and php to work with utf data. The problem is really just in the source data Im importing.

Thanks

dreftymac
  • 31,404
  • 26
  • 119
  • 182
carpii
  • 1,917
  • 4
  • 20
  • 24
  • 3
    There is no such thing as "a UTF-8 character". Perhaps you meant "the UTF-8 encoding of the Unicode character with that codepoint". – Ignacio Vazquez-Abrams Aug 15 '11 at 03:12
  • @Ignacio Indeed, but I would define a "UTF-8 character" as "a sequence of one, two, three or four bytes that encode a Unicode character". Would that be a valid definition? – deceze Aug 15 '11 at 03:18
  • 2
    @deceze: Technically that's called a "UTF-8 sequence". – Ignacio Vazquez-Abrams Aug 15 '11 at 03:25
  • As this question is coming up on searches related to `'\u00xx' in string problems`: Please make sure to read about [php's multibyte string functions](https://www.php.net/manual/en/ref.mbstring.php) - functions like strpos are not multibyte safe, therefore mb_strpos etc exist! – til Jun 02 '23 at 15:36

4 Answers4

34

/* Function php for convert utf8 html to ansi */

public static function Utf8_ansi($valor='') {

    $utf8_ansi2 = array(
    "\u00c0" =>"À",
    "\u00c1" =>"Á",
    "\u00c2" =>"Â",
    "\u00c3" =>"Ã",
    "\u00c4" =>"Ä",
    "\u00c5" =>"Å",
    "\u00c6" =>"Æ",
    "\u00c7" =>"Ç",
    "\u00c8" =>"È",
    "\u00c9" =>"É",
    "\u00ca" =>"Ê",
    "\u00cb" =>"Ë",
    "\u00cc" =>"Ì",
    "\u00cd" =>"Í",
    "\u00ce" =>"Î",
    "\u00cf" =>"Ï",
    "\u00d1" =>"Ñ",
    "\u00d2" =>"Ò",
    "\u00d3" =>"Ó",
    "\u00d4" =>"Ô",
    "\u00d5" =>"Õ",
    "\u00d6" =>"Ö",
    "\u00d8" =>"Ø",
    "\u00d9" =>"Ù",
    "\u00da" =>"Ú",
    "\u00db" =>"Û",
    "\u00dc" =>"Ü",
    "\u00dd" =>"Ý",
    "\u00df" =>"ß",
    "\u00e0" =>"à",
    "\u00e1" =>"á",
    "\u00e2" =>"â",
    "\u00e3" =>"ã",
    "\u00e4" =>"ä",
    "\u00e5" =>"å",
    "\u00e6" =>"æ",
    "\u00e7" =>"ç",
    "\u00e8" =>"è",
    "\u00e9" =>"é",
    "\u00ea" =>"ê",
    "\u00eb" =>"ë",
    "\u00ec" =>"ì",
    "\u00ed" =>"í",
    "\u00ee" =>"î",
    "\u00ef" =>"ï",
    "\u00f0" =>"ð",
    "\u00f1" =>"ñ",
    "\u00f2" =>"ò",
    "\u00f3" =>"ó",
    "\u00f4" =>"ô",
    "\u00f5" =>"õ",
    "\u00f6" =>"ö",
    "\u00f8" =>"ø",
    "\u00f9" =>"ù",
    "\u00fa" =>"ú",
    "\u00fb" =>"û",
    "\u00fc" =>"ü",
    "\u00fd" =>"ý",
    "\u00ff" =>"ÿ");

    return strtr($valor, $utf8_ansi2);      

}
28

There's a way. Replace all uXXXX with their HTML representation and do an html_entity_decode()

I.e. echo html_entity_decode("Jalostotitlán");

Every UTF character in the form u1234 could be printed in HTML as ሴ. But doing a replace is quite hard, because there could be much false positives if there is no other char that identifies the beginning of an UTF sequence. A simple regex could be

preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str)

rabudde
  • 7,498
  • 6
  • 53
  • 91
  • Thanks, nice simple solution which I hadn't thought of. I think it will be safe to use this, because the data I am trying to fix should not have any numerics in it. The only reason they do is because of the messed up UTF, so these should be easy to identify – carpii Aug 16 '11 at 11:25
  • 1
    Be careful! You can't reliably recover from a data mangling as bad as this unless your data are really restricted. Taking any u-hex-hex-hex sequence as a mangled Unicode escape would, for example, turn the word “persuaded” into “pers귭”... – bobince Aug 16 '11 at 20:30
  • @bobince correct, that's why I wrote it's not quite easy, because of "false positives". So important to have an identifier. – rabudde Aug 17 '11 at 04:59
  • My twitter timeline script return the special character like é into \u00e9 so I can use the backslash as an identifier then right? – Theo Nov 24 '13 at 21:28
  • 3
    yes, it's much better, than having no identifier. so you would modify the regex to `preg_replace('/\\u([\da-fA-F]{4})/', '\1;', $str)` (notice that the backslash is escaped) – rabudde Nov 25 '13 at 06:00
  • just like in Tina Turner song you are simply the best :) – Oleg Popov Mar 05 '15 at 15:42
  • Actually you have to escape the \ twice `preg_replace('/\\\\u([\da-fA-F]{4})/', '\1;', $str)` to make it work – oniramarf Mar 01 '22 at 09:54
  • For me, this doesn't work. The `é` becomes `\u00e9` in my database, when I run `html_entity_decode()` on it, nothing happens. When I replace `\u` with `` as suggested, and run it through `html_entity_decode()` I get this character: `ຜ` - I'll keep looking for a solution for this :( – AutoBaker Mar 06 '23 at 16:32
3

My twitter timeline script returns the special characters like é into \u00e9 so I stripped the backslash and used @rubbude his preg_replace.

// Fix uxxxx charcoding to html
$text = "De #Haarstichting is h\u00e9t medium voor alles Into:  De #Haarstichting is hét medium voor alles";
$str     = str_replace('\u','u',$text);
$str_replaced = preg_replace('/u([\da-fA-F]{4})/', '&#x\1;', $str);

echo $str_replaced;

It workes for me and it turns: De #Haarstichting is h\u00e9t medium voor alles Into: De #Haarstichting is hét medium voor alles

sampoh
  • 3,066
  • 1
  • 16
  • 14
Theo
  • 150
  • 11
  • 7
    no! don't strip the backslash from `\u`, because it could be used as identifier. use a modified regex `preg_replace('/\\u([\da-fA-F]{4})/', '\1;', $str)` instead – rabudde Nov 25 '13 at 06:02
  • Right, that's what I need. Offcourse my stripping is wrong, it strips the only identifier I had. Thank you @rabbude I am testing this tonight and will update this answer with your preg_replace. – Theo Nov 25 '13 at 13:29
  • 1
    Right @rabbude, now I remember why I didn't use the \\u myself: `Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1` – Theo Nov 25 '13 at 22:08
  • 7
    Sorry, this could be my fault, try to double escape it: `preg_replace('/\\\\u([\da-fA-F]{4})/', '\1;', $str)` – rabudde Nov 26 '13 at 18:43
0

Although it's late to respond after so many years, for the next time I need it, I'll remember that this function worked nicely for me:

mb_convert_encoding($string, 'ISO-8859-1', 'UTF-8');

That's exactly the same as utf8_decode but that one is DEPRECATED as of PHP 8.2.0

Vladan
  • 725
  • 8
  • 13