331

I'm reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

  1. The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.

  2. Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.

  3. In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.

What can I do to avoid the cases 2 and 3?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:

  1. How do I find out what encoding the text uses?
  2. How do I convert it to UTF-8 - whatever the old encoding is?

Would a function like this work?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it, but it doesn't work. What's wrong with it?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
caw
  • 30,999
  • 61
  • 181
  • 291
  • 41
    "The "ß" in "Fußball" should look like this in my database: "Ÿ".". No it should look like ß. Make sure you collation and connection are set up correctly. Otherwise sorting and searching will be broken for you. – Rich Bradshaw Mar 06 '13 at 12:20
  • 5
    Your database is badly setup. If you want to store Unicode content, just configure it for that. So instead of trying to workaround the issue in your PHP code, you should first fix the database. – dolmen Jun 09 '14 at 23:55
  • 2
    USE: $from=mb_detect_encoding($text); $text=mb_convert_encoding($text,'UTF-8',$from); – Informate.it Jul 30 '14 at 17:25

26 Answers26

385

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sebastián Grignoli
  • 32,444
  • 17
  • 71
  • 86
  • Thank you very much, this is exactly what I was looking for :) But it would be best to have only one single function which does everything. So forceUTF8() should include fixUTF8()'s skills. – caw Aug 26 '10 at 15:49
  • 1
    Well, if you look at the code, fixUTF8 simply calls forceUTF8 once and again until the string is returned unchanged. One call to fixUTF8() takes at least twice the time of a call to forceUTF8(), so it's a lot less performant. I made fixUTF8() just to create a command line program that would fix "encode-corrupted" files, but in a live environment is rarely needed. – Sebastián Grignoli Aug 27 '10 at 03:33
  • 3
    How does this convert non-UTF8 characters to UTF8, without knowing what encoding the invalid characters are in to begin with? – philfreo Sep 15 '10 at 05:13
  • 4
    It assumes ISO-8859-1, the answer already says this. The only difference between forceUTF8() and utf8_encode() is that forceUTF8() recognizes UTF8 characters and keeps them unchanged. – Sebastián Grignoli Sep 15 '10 at 20:29
  • i had to add $value = str_ireplace("�", "à", $value); before using fixUTF8 – max4ever May 13 '13 at 16:29
  • If you get a code 500 error it means that your php doesn't support namespaces. You can safely remove it in that case. (line 41) – bicycle Jun 19 '13 at 12:57
  • @SebastiánGrignoli would be nice if you could integrate fixUTF8 and toUTF8 into a single (additional?) function. Also an array_walk function with this would be nice :) – bicycle Jun 19 '13 at 12:59
  • These functions already walks arrays recursively if you provide them instead of strings. fixUTF8 is not really intended for production environments. See the second comment on this answer. – Sebastián Grignoli Jun 19 '13 at 13:56
  • 31
    *"You dont need to know what the encoding of your strings is."* - I very much disagree. Guessing and trying may work, but you'll always sooner or later encounter edge cases where it doesn't. – deceze Jun 27 '13 at 19:26
  • 4
    I totally agree. In fact, I didn't mean to state that as a general rule, just explain that this class might help you if that's the situation you happen to find yourself in. – Sebastián Grignoli Jun 27 '13 at 22:26
  • By the way, if the enconding of your string is one of those that I listed, it will always work except for the cases that are mentioned in the comments of the class. Also, fixUTF8() -the second one- comes with a warning: don't use it on production. It will "fix" double encoded strings, but sometimes you want them unfixed, just like in my answer depicting them. – Sebastián Grignoli Jun 27 '13 at 22:31
  • My experience tells me that your code is probably slow. Checking UTF-8 with a regex like in [this answer of mine](http://stackoverflow.com/questions/20025030/convert-all-types-of-smart-quotes-with-php/21491305#21491305) (halfway down) is probably much faster – Walter Tross Feb 04 '14 at 22:07
  • I don't completely understand your regex, but what I wanted to achieve was to be sure that any UTF-8+Win1252/Latin1 mixed encoding strings would always be converted to UTF8, and it does that well. These are sanitization functions, not intended for the frontend tier. – Sebastián Grignoli Feb 06 '14 at 04:14
  • If you sanitize the Win1252 string "…Gruß…" ("\x85Gru\xDF\x85") your way, you end up with "…Gru߅" ("\xE2\x80\xA6Gru\xDF\x85") – Walter Tross Feb 07 '14 at 08:54
  • 1
    My regex checks that a string consists of UTF-8 characters from start to end. This is different from what you do, but I don't think that allowing mixed encodings is a good idea. – Walter Tross Feb 07 '14 at 09:01
  • It depend on your needs. It's a tradeoff and it's fine as long as you know what's going on. The edge case you mention is noted on the comments on the source code of the class. – Sebastián Grignoli Feb 08 '14 at 10:03
  • The original version did not support Win1252, just Latin1+UTF-8. There were less probable misses then. Latin1 does not have an ellipsis where Win1252 does. – Sebastián Grignoli Feb 12 '14 at 04:19
  • I've used your function fix a hacky problem. But I want to know, what exactly does your toWin1252 function do? Why is your toWin1252, toISO8859 and toLatin1 do all the same thing? – CMCDragonkai Nov 07 '14 at 15:06
  • 1
    They are all aliases. Latin1 is a nickname for the ISO8859-1 encoding. Win1252 is almost the same encoding, but with some added characters. At first my function did not recognize those extra characters, but all software that claims to support Latin1 are in fact using Win1252, so It's better to support it here, I guess. – Sebastián Grignoli Nov 09 '14 at 03:03
  • 1
    working with f*** polish letters and Encoding::toUTF8 doesnt work... i receive "?" everywhere. One file is in windows-1250, other one is mixed with something - both fails – xoxn-- 1'w3k4n Jun 05 '17 at 13:42
  • `require_once('Encoding.php');` and `use \ForceUTF8\Encoding;` need to use before declare class – Intacto Nov 28 '17 at 16:38
  • @SebastiánGrignoli fixUTF8() has problems with german umlauts. Lowercase chars are converted correctly `ä => ä`, `ö => ö` but uppercase don't work `Ä => Ã?` which has to be `Ä`. Also `ß` does not get converted to `ß` Is there a way to extend this list in source code? – rabudde Jan 17 '18 at 07:02
  • This looks amazing. Does anyone know how to turn it into a script that can look at all (txt, md, php, css, js, html, htm, ...) files in a directory and sub-directories and run @SebastiánGrignoli above script on them ? Or can I somehow add it to a file explorer (like xyPlorer for windows), or the windows context menu to apply to an entire folder ? – SherylHohman Jan 30 '19 at 18:04
  • Here you go: https://gist.github.com/neitanod/a5eff5bc5b7b49449ea4c952e2a02d28 Replace every `force` with `fix` and `::toUTF8(` with `::fixUTF8(` to use with FIX function instead of FORCE. Always backup your files first! – Sebastián Grignoli Feb 01 '19 at 17:26
  • I used it in a php script with around 1000s of emails using toUTF8() in a loop. And the script crashes. Then i used if condition as suggested by Christian and harpax. And the combination brought faster results, no crashes. – Sunil Kumar May 22 '19 at 11:55
79

You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.


Here is what I probably would do:

I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

$accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
    // error fetching the response
} else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
        // error parsing the response
    } else {
        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
            // type not accepted
        }
        $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
        $body = substr($response, $offset + 4);
        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
            $encoding = trim($match[1], '"\'');
        }
    }
    if (!$encoding) {
        $encoding = 'utf-8';
    } else {
        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
            // encoding not accepted
        }
        if ($encoding != 'utf-8') {
            $body = mb_convert_encoding($body, 'utf-8', $encoding);
        }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
        // parse error
    } else {
        echo $simpleXML->asXML();
    }
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • Thanks. This would be easy. But would it really work? There are often wrong encodings given in the HTTP headers or in the attributes of XML. – caw May 27 '09 at 15:47
  • 26
    Again: That’s not your problem. Standards were established to avoid such troubles. If others don’t follow them, it’s their problem, not yours. – Gumbo May 27 '09 at 16:01
  • Thanks for the code. But why not simply use this? http://paste.bradleygill.com/index.php?paste_id=9651 Your code is much more complex, what's better with it? – caw May 29 '09 at 20:33
  • Well, firstly you’re making two requests, one for the HTTP header and one for the data. Secondly, you’re looking for any appearance of `charset=` and `encoding=` and not just at the appropriate positions. And thirdly, you’re not checking if the declared encoding is accepted. – Gumbo May 29 '09 at 20:44
  • You’re not sending any encoding information. Thus the default in HTML (ISO 8859-1) is used. – Gumbo May 30 '09 at 11:01
  • No, that's not the cause. In line 26 of your code there is an error: undefined offset 2: $encoding = trim($match[2], '"\''); Sometimes the characters are correct (ö instead of ö), sometimes they aren't (À instead of ä). So there must be something wrong in your code or in the feed I want to parse. – caw May 30 '09 at 15:10
  • Well then add a line to check if `$match[2]` exists before using it. – Gumbo May 30 '09 at 15:54
  • If $match[2] is set, it's clear that everything is going on as normal. But what to do if $match[2] is not set? Return false? – caw May 31 '09 at 12:34
  • No, just do nothing. If there is no encoding declared in the HTTP header, the encoding in the XML declaration is used. And if that’s missing too, the default encoding is used. – Gumbo May 31 '09 at 12:41
  • Yes, logical. :) My very last question: Why is the following line there? if (!in_array($encoding, array_map('strtolower', $accept['charset']))) { // encoding not accepted } Can't I just let it out? – caw May 31 '09 at 13:02
  • That piece of code was intended to accept just the charsets/encodings `mb_convert_encoding` accepts (see `mb_list_encodings`). Otherwise `mb_convert_encoding` will probably throw an error. – Gumbo May 31 '09 at 13:12
  • But it doesn't prevent block wrong encodings/charsets since the following line is no elseif but a normal if, right? So the line can be deleted without changing something, can't it? – caw May 31 '09 at 14:31
  • Your code also gives this error message: Warning: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specified – caw Jun 04 '09 at 15:19
  • Then try to find out the cause of this error. It took me just ten minutes to write that code and didn’t tested it well. It might have some errors more than this. – Gumbo Jun 04 '09 at 15:39
44

Detecting the encoding is hard.

mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.

troelskn
  • 115,121
  • 27
  • 131
  • 155
  • Thank you very much! What's better: mb-convert-encoding() or iconv()? I don't know what the differences are. Yes, I will only have to parse Western European languages, especially English, German and French. – caw May 26 '09 at 14:42
  • 8
    I've just seen: mb-detect-encoding() ist useless. It only supports UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS and ISO-2022-JP. The most important ones for me, ISO-8859-1 and WINDOWS-1252, aren't supported. So I can't use mb-detect-encoding(). – caw May 26 '09 at 18:49
  • 1
    My, you're right. It's been a while since I've used it. You'll have to write your own detection-code then, or use an external utility. UTF-8 can be fairly reliably determined, because its escape sequences are quite characteristic. wp-1252 and iso-8859-1 can be distinguished because wp-1252 may contain bytes that are illegal in iso-8859-1. Use Wikipedia to get the details, or look in the comments-section of php.net, under various charset-related functions. – troelskn May 26 '09 at 19:03
  • I think you can distinguish the different encodings when you look at the forms which the special sings emerge in: The German "ß" emerges in different forms: Sometimes "Ÿ", sometimes "ß" and sometimes "ß". Why? – caw May 26 '09 at 19:47
  • Yes, but then you need to know the contents of the string before comparing it, and that kind of defeats the purpose in the first place. The German ß appears differently because it has different values in different encodings. Somce characters happen to be represented in the same way in different encodings (eg. all characters in the ascii charset are encoded in the same way in utf-8, iso-8859-* and wp-1252), so as long as you use just those characters, they all look the same. That's why they are some times called ascii-compatible. – troelskn May 26 '09 at 20:36
  • Ok, then it's quite easy, isn't it? Can't I just look for "Ã" in the texts? This only emerges if the text is double UTF-8 encoded, so too often encoded. So I must only decode it one time, right? The "Ã" wouldn't appear if the text is correct since the "Â" doesn't appear in German or English texts normally. Would this be a good approach? How could I code this in PHP? Would it work? – caw May 27 '09 at 15:40
  • You cannot always tell just from looking for such oddities if some data is not proper encoded. There always might be the possibility that they are intended. Take your own question as an example. – Gumbo May 27 '09 at 16:03
  • Yes, they might be intended. But I would be fine for me if 99% of the texts are displayed correctly and only 1% is displayed wrongly because the "strange" characters were intended. If there was a possibility to achieve this, I would like to use it. – caw May 27 '09 at 16:31
  • @marco92w: Well then I’d suggest to try the standards way. I’d say the error rate is not much higher than with your guessing method. But even if it’s higher you would support the standards. – Gumbo May 27 '09 at 16:36
  • Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? http://paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :) – caw May 27 '09 at 17:04
  • looks like ISO-8859-* and Windows-1252 are supported by mb_detect_encoding http://www.php.net/manual/en/mbstring.supported-encodings.php – chim Dec 08 '11 at 16:13
  • Unless you know better, test if your input is valid UTF-8 string and if not, blindly convert from Windows-1252 to UTF-8. This usually works for Western European Languages because if the input happens to be ISO-8859-1, it's a subset of Windows-1252 and the conversion will be correct. The only really problematic issue is ISO-8859-15 which as EUR sign ("€") in position 0xA4 whereas Windows-1252 has generic currency sign ("¤") in the same position. You can apply some heuristics to decide between ISO-8859-15 and Windows-1252 but you can never be sure. – Mikko Rantalainen Dec 11 '15 at 12:13
  • @MikkoRantalainen windows-1252 is not a subset of iso-8859-1 though. They are almost identical except for a few code points (Notably some quote characters). – troelskn Dec 12 '15 at 13:28
14

This cheatsheet lists some common caveats related to UTF-8 handling in PHP: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

This function detecting multibyte characters in a string might also prove helpful (source):


function detectUTF8($string)
{
    return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )+%xs', 
    $string);
}
miek
  • 3,446
  • 2
  • 29
  • 31
11

A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.

This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.

Take a look at mysql_set_charset. It may help you.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Krynble
  • 614
  • 7
  • 18
6

Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

Here's some pseudocode of what you did:

$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);

You should try:

  1. detect encoding using mb_detect_encoding() or whatever you like to use
  2. if it's UTF-8, convert into ISO 8859-1, and repeat step 1
  3. finally, convert back into UTF-8

That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ivan Vučica
  • 9,529
  • 9
  • 60
  • 111
5

A really nice way to implement an isUTF8-function can be found on php.net:

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}
harpax
  • 5,986
  • 5
  • 35
  • 49
  • 24
    Unfortunately, this only works when the string only consists of characters that are included in ISO-8859-1. But this could work: @iconv('utf-8', 'utf-8//IGNORE', $str) == $str – Christian Davén Aug 17 '11 at 12:20
  • @Christian: Indeed, that's what the authors of High Performance MySQL recommend too. – Alix Axel Dec 19 '11 at 07:39
  • 1
    Its doesn't work correctly: echo (int)isUTF8(' z'); # 1 echo (int)isUTF8(NULL); # 1 – Yousha Aleayoub Aug 02 '12 at 15:47
  • 1
    Though not perfect, I think this is a nice way to implement a sketchy UTF-8 check. – Mateng Apr 02 '13 at 14:13
  • 2
    `mb_check_encoding($string, 'UTF-8')` – deceze Jun 27 '13 at 19:29
  • 6
    Just to put into context how badly this will work: there are exactly 191 printable characters in ISO 8859-1; Unicode 13 defines about 140000. So if you pick a random Unicode character, encode it correctly as UTF-8, and pass it to this function, there is a more than 99% chance of this function incorrectly returning false. In case you think those are obscure characters, note that ISO 8859-1 has no Euro symbol, so `isUTF8('€')` will be among that 99%. – IMSoP Mar 23 '21 at 19:33
4

The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:

// $input is actually UTF-8

mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)

mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)

So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.

Halil Özgür
  • 15,731
  • 6
  • 49
  • 56
  • 2
    This happens because ISO-8859-9 will in practice accept any binary input. The same goes for Windows-1252 and friends. You have to first test for encodings that can fail to accept the input. – Mikko Rantalainen Dec 11 '15 at 12:19
  • @MikkoRantalainen, yeah, I guess this part of the docs says something similar: http://php.net/manual/en/function.mb-detect-order.php#example-2985 – Halil Özgür Dec 11 '15 at 14:37
  • Considering that WHATWG HTML spec defines Windows 1252 as the default encoding, it should be pretty safe to assume `if ($input_is_not_UTF8) $input_is_windows1252 = true;`. See also: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding – Mikko Rantalainen Sep 18 '18 at 06:30
2

You need to test the character set on input since responses can come coded with different encodings.

I force all content been sent into UTF-8 by doing detection and translation using the following function:

function fixRequestCharset()
{
  $ref = array(&$_GET, &$_POST, &$_REQUEST);
  foreach ($ref as &$var)
  {
    foreach ($var as $key => $val)
    {
      $encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
      if (!$encoding)
        continue;
      if (strcasecmp($encoding, 'UTF-8') != 0)
      {
        $encoding = iconv($encoding, 'UTF-8', $var[$key]);
        if ($encoding === false)
          continue;
        $var[$key] = $encoding;
      }
    }
  }
}

That routine will turn all PHP variables that come from the remote host into UTF-8.

Or ignore the value if the encoding could not be detected or converted.

You can customize it to your needs.

Just invoke it before using the variables.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
cavila
  • 7,834
  • 5
  • 21
  • 19
  • what is the purpose of using mb_detect_order() without a passed in encoding list? – giorgio79 Dec 20 '14 at 16:28
  • The purpose is to return the system configured ordered array of encodings defined in php.ini used. This is required by mb_detect_encoding to fill third parameter. – cavila Jan 10 '15 at 13:38
2

Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.

So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).

Kevin ORourke
  • 691
  • 9
  • 10
  • I don't want to read out the encoding from the feed information. So it's equal if the feed information are wrong. I would like to detect the encoding from the text. – caw May 26 '09 at 18:45
  • @marco92w: It’s not your problem if the declared encoding is wrong. Standards have not been established for fun. – Gumbo May 26 '09 at 20:14
  • 1
    @Gumbo: but if you're working in the real world you have to be able to deal with things like incorrect declared encodings. The problem is that it's very difficult to guess (correctly) the encoding just from some text. Standards are wonderful, but many (most?) of the pages/feeds out there doesn't comply with them. – Kevin ORourke May 27 '09 at 12:22
  • @Kevin ORourke: Exactly, right. That's my problem. @Gumbo: Yes, it's my problem. I want to read out the feeds and aggregate them. So I must correct the wrong encodings. – caw May 27 '09 at 15:37
  • @marco92w: But you cannot correct the encoding if you don’t know the correct encoding and the current encoding. And that’s what the `charset`/`encoding` declaration if for: describe the encoding the data is encoded in. – Gumbo May 27 '09 at 16:20
  • Oh, now I've understood it. I thought it would be possible because I can surely say that "Ã" can't appear but "Ÿ" does. Another method I had imagined was to utf8_decode() it and then look whether it is a normal text. If there is any "Ã" after utf8_decode() then it must be wrong. – caw May 27 '09 at 16:29
  • @marco92w: Again, the character that’s shown to you depends on the character encoding/set that was used to interpret the data. If you interpret UTF-8 encoded with something other than UTF-8 you will probably get some oddities (excet you’re just using ASCII characters). – Gumbo May 27 '09 at 16:33
  • Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? http://paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :) – caw May 27 '09 at 17:04
2

mb_detect_encoding:

echo mb_detect_encoding($str, "auto");

Or

echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");

I really don't know what the results are, but I'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.

auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". It returns the detected charset, which you can use to convert the string to UTF-8 with iconv.

<?php
function convertToUTF8($str) {
    $enc = mb_detect_encoding($str);

    if ($enc && $enc != 'UTF-8') {
        return iconv($enc, 'UTF-8', $str);
    } else {
        return $str;
    }
}
?>

I haven't tested it, so no guarantee. And maybe there's a simpler way.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
stefs
  • 18,341
  • 6
  • 40
  • 47
  • Thank you. What's the difference between 'auto' and 'UTF-8, ASCII, ISO-8859-1' as the second argument? Does 'auto' feature more encodings? Then it would be better to use 'auto', wouldn't it? If it really works without any bugs then I must only change "ASCII" or "ISO-8859-1" to "UTF-8". How? – caw May 26 '09 at 14:14
  • 2
    Your function doesn't work well in all cases. Sometimes I get an error: Notice: iconv(): Detected an illegal character in input string in ... – caw May 26 '09 at 19:50
1

I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.

Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s.

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}
jocull
  • 20,008
  • 22
  • 105
  • 149
  • Thanks for the answer, jocull. The function mb_convert_encoding() is what we've already had here, right? ;) So the only new thing in your answer is the loops to change encoding in all variables. – caw May 23 '10 at 21:54
1

harpax' answer worked for me. In my case, this is good enough:

if (isUTF8($str)) {
    echo $str;
}
else
{
    echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
PJ Brunet
  • 3,615
  • 40
  • 37
1

I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here are my notes:

This is my test string:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chàrs to see thèm, convertèd by fùnctìon!! & that's it!

I do an INSERT to save this string on a database in a field that is set as utf8_general_ci

The character set of my page is UTF-8.

If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...

So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...

So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chà rs to see thèm, convertèd by fùnctìon!! & that's it!

So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:

$finallyIDidIt = mb_convert_encoding(
  $string,
  mysql_client_encoding($resourceID),
  mb_detect_encoding($string)
);

Now in my database I have my string with correct encoding.

NOTE:

Only note to take care of is in function mysql_client_encoding! You need to be connected to the database, because this function wants a resource ID as a parameter.

But well, I just do that re-encoding before my INSERT so for me it is not a problem.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mauro
  • 19
  • 1
  • 2
    Why do you not just use `UTF-8` client encoding for mysql in the first place? Would not need manual conversion this way – Esailija Jul 31 '12 at 07:14
1

It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.

So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.

However, if you're fetching an UTF-8 feed, you don't need to do anything.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Seb
  • 24,920
  • 5
  • 67
  • 85
  • Thanks! OK, I can find out how the feed is encoded by using mb-detect-encoding(), right? But what can I make if the feed is ASCII? utf8-encode() ist just for ISO-8859-1 to UTF-8, isn't it? – caw May 26 '09 at 13:58
  • ASCII is a subset of ISO-8859-1 AND UTF-8, so using utf8-encode() should not make a change - IF it's actually just ASCII – Michael Borgwardt May 26 '09 at 14:12
  • So I can always use utf8_encode if it's not UTF-8? This would be really easy. The text which was ASCII according to mb-detect-encoding() contained "ä". Is this a ASCII character? Or is it HTML? – caw May 26 '09 at 15:06
  • That's HTML. Actually that's encoded so when you print it in a given page it shows ok. If you want you can first ut8_encode() then html_entity_decode(). – Seb May 26 '09 at 16:23
  • Yes, html_entity_decode() works in this case. But: The German "ß" emerges in different forms: Sometimes "Ÿ", sometimes "ß" and sometimes "ß". Why? – caw May 26 '09 at 19:46
  • 2
    The character ß is encoded in UTF-8 with the byte sequence 0xC39F. Interpreted with Windows-1252, that sequence represents the two characters  (0xC3) and Ÿ (0x9F). And if you encode this byte sequence again with UTF-8, you’ll get 0xC383 0xC29F what represents ß in Windows-1252. So your mistake is to handle this UTF-8 encoded data as something with an encoding other than UTF-8. That this byte sequence is presented as the character you’re seeing is just a matter of interpretation. If you use an other encoding/charset, you’ll probably see other characters. – Gumbo May 26 '09 at 20:12
  • Thank you. First, I want to say that all UTF-8 characters are shown as interpreted with Windows-1252 in my PHPMyAdmin. I don't handle them wrong. "Ÿ" is displayed correctly as "ß". I do the same things with all RSS feeds but some feeds are parsed as "Ÿ" and some are parsed as "ß". That's the problem. Can't I do the following: Look for "Ã" in the text. If it is in the text, then it must be double UTF-8 encoded. So I simply decode it one time and everything is fine. Would this work? How could I code this? – caw May 27 '09 at 15:36
  • That’s why you should take the declared encoding into account. Because not every data is encoded with the same encoding using the same character set. There are plenty different character sets. Just by looking at the byte sequences you cannot determine what character set had been used. Take the ISO 8859 character set family as an example: 15 different character sets all use the same encoding. – Gumbo May 27 '09 at 16:15
  • Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? http://paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :) – caw May 27 '09 at 17:04
0

Get the encoding from headers and convert it to UTF-8.

$post_url = 'http://website.domain';

/// Get headers ///////////////////////////////////////////////
function get_headers_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url);
    curl_setopt($ch, CURLOPT_HEADER,         true);
    curl_setopt($ch, CURLOPT_NOBODY,         true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT,        15);

    $r = curl_exec($ch);
    return $r;
}

$the_header = get_headers_curl($post_url);

/// Check for redirect ////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
    $arr = explode('Location:', $the_header);
    $location = $arr[1];

    $location = explode(chr(10), $location);
    $location = $location[0];

    $the_header = get_headers_curl(trim($location));
}

/// Get charset ///////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
    $arr = explode('charset=', $the_header);
    $charset = $arr[1];

    $charset = explode(chr(10), $charset);
    $charset = $charset[0];
}

///////////////////////////////////////////////////////////////////
// echo $charset;

if($charset && $charset != 'UTF-8') {
    $html = iconv($charset, "UTF-8", $html);
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Arsen
  • 3
  • 2
0

Ÿ is Mojibake for ß. In your database, you may have one of the following hex values (use SELECT HEX(col)...) to find out):

  • DF if the column is "latin1",
  • C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
  • C383C5B8 if double-encoded into a utf8 column

You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.

If MySQL is involved, see: Trouble with UTF-8 characters; what I see is not what I stored

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • What do you mean by *"you may have hex"*? Arbitrary binary data? Or something else? Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/39045901/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Apr 21 '22 at 10:13
  • @PeterMortensen - Yeah, my wording was rather cryptic. I hope I my clarification helps. Do a `SELECT HEX(col)...` to see what is in the table. – Rick James Apr 21 '22 at 20:25
0
if(!mb_check_encoding($str)){
    $str = iconv("windows-1251", "UTF-8", $str);
}

It helped for me

Mike S
  • 192
  • 3
  • 9
0

After sorting out your PHP scripts, don't forget to tell MySQL what charset you are passing and would like to receive.

Example: set the character to UTF-8

Passing UTF-8 data to a Latin 1 table in a Latin 1 I/O session gives those nasty birdfeets. I see this every other day in OsCommerce shops. Back and fourth it might seem right. But phpMyAdmin will show the truth. By telling MySQL what charset you are passing, it will handle the conversion of MySQL data for you.

How to recover existing scrambled MySQL data is another question. :)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
tim
  • 2,530
  • 3
  • 26
  • 45
-1

I had the same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:

$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;

mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
-1

I found a solution at http://deer.org.ua/2009/10/06/1/:

class Encoding
{
    /**
     * http://deer.org.ua/2009/10/06/1/
     * @param $string
     * @return null
     */
    public static function detect_encoding($string)
    {
        static $list = ['utf-8', 'windows-1251'];

        foreach ($list as $item) {
            try {
                $sample = iconv($item, $item, $string);
            } catch (\Exception $e) {
                continue;
            }
            if (md5($sample) == md5($string)) {
                return $item;
            }
        }
        return null;
    }
}

$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
    $result = iconv($encoding, 'utf-8', $content);
} else {
    $result = $content;
}

I think that @ is a bad decision and made some changes to the solution from deer.org.ua.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
outdead
  • 448
  • 6
  • 15
-1

For Chinese characters, it is common to be encoded in the GBK encoding. In addition, when tested, the most voted answer doesn't work. Here is a simple fix that makes it work as well:

function toUTF8($raw) {
    try{
        return mb_convert_encoding($raw, "UTF-8", "auto"); 
    }catch(\Exception $e){
        return mb_convert_encoding($raw, "UTF-8", "GBK"); 
    }
}

Remark: This solution was written in 2017 and should fix problems for PHP in those days. I have not tested whether latest PHP already understands auto correctly.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ch271828n
  • 15,854
  • 5
  • 53
  • 88
  • 1
    Do you have any insight why, or how your files were different? What parts didn't work for you? For example: Uppercase German characters didn't convert correctly. Curious, what is "GBK" ? – SherylHohman Jan 30 '19 at 17:58
  • In what way doesn't the most voted answer work? – Peter Mortensen Apr 20 '22 at 23:42
  • An explanation would be in order. E.g., what is the idea/gist? From [the Help Center](https://stackoverflow.com/help/promotion): *"...always explain why the solution you're presenting is appropriate and how it works"*. Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/44816067/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Apr 20 '22 at 23:46
-1

Try without 'auto'

That is:

mb_detect_encoding($text)

instead of:

mb_detect_encoding($text, 'auto')

More information can be found here: mb_detect_encoding

YakovL
  • 7,557
  • 12
  • 62
  • 102
tkartas
  • 51
  • 1
  • An explanation would be in order. E.g., what is the idea/gist? What kind of input was it tested on? From [the Help Center](https://stackoverflow.com/help/promotion): *"...always explain why the solution you're presenting is appropriate and how it works"*. Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/45252648/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Apr 20 '22 at 23:47
-1

Try to use this... every text that is not UTF-8 will be translated.

function is_utf8($str) {
    return (bool) preg_match('//u', $str);
}

$myString = "Fußball";

if(!is_utf8($myString)){
    $myString = utf8_encode($myString);
}

// or 1 line version ;) 
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);
MMJ
  • 555
  • 4
  • 6
-1

When you try to handle multi languages, like Japanese and Korean, you might get in trouble.

mb_convert_encoding with the 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.

I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.

The below snippet extracts the title element from a web page. If you would like to convert the entire page, then you may want to remove some lines.

<?php
require_once 'simple_html_dom.php';

echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;

function convert_title_to_utf8($contents)
{
    $dom = str_get_html($contents);
    $title = $dom->find('title', 0);
    if (empty($title)) {
        return null;
    }
    $title = $title->plaintext;
    $metas = $dom->find('meta');
    $charset = 'auto';
    foreach ($metas as $meta) {
        if (!empty($meta->charset)) { // HTML5
            $charset = $meta->charset;
        } else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
            $charset = $match[1];
        }
    }
    if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
        $charset = 'auto';
    }
    return mb_convert_encoding($title, 'UTF-8', $charset);
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nobu
  • 9,965
  • 4
  • 40
  • 47
-1

This version is for the German language, but you can modify the $CHARSETS and the $TESTCHARS.

class CharsetDetector
{
    private static $CHARSETS = array(
        "ISO_8859-1",
        "ISO_8859-15",
        "CP850"
    );

    private static $TESTCHARS = array(
        "€",
        "ä",
        "Ä",
        "ö",
        "Ö",
        "ü",
        "Ü",
        "ß"
    );

    public static function convert($string)
    {
        return self::__iconv($string, self::getCharset($string));
    }

    public static function getCharset($string)
    {
        $normalized = self::__normalize($string);
        if(!strlen($normalized))
            return "UTF-8";
        $best = "UTF-8";
        $charcountbest = 0;
        foreach (self::$CHARSETS as $charset)
        {
            $str = self::__iconv($normalized, $charset);
            $charcount = 0;
            $stop = mb_strlen($str, "UTF-8");

            for($idx = 0; $idx < $stop; $idx++)
            {
                $char = mb_substr($str, $idx, 1, "UTF-8");
                foreach (self::$TESTCHARS as $testchar)
                {
                    if($char == $testchar)
                    {
                        $charcount++;
                        break;
                    }
                }
            }

            if($charcount > $charcountbest)
            {
                $charcountbest = $charcount;
                $best = $charset;
            }
            //echo $text . "<br />";
        }
        return $best;
    }

    private static function __normalize($str)
    {
        $len = strlen($str);
        $ret = "";
        for($i = 0; $i < $len; $i++)
        {
            $c = ord($str[$i]);
            if ($c > 128) {
                if (($c > 247))
                    $ret .= $str[$i];
                elseif
                    ($c > 239) $bytes = 4;
                elseif
                    ($c > 223) $bytes = 3;
                elseif
                    ($c > 191) $bytes = 2;
                else
                    $ret .= $str[$i];

                if (($i + $bytes) > $len)
                    $ret .= $str[$i];
                $ret2 = $str[$i];
                while ($bytes > 1)
                {
                    $i++;
                    $b = ord($str[$i]);
                    if ($b < 128 || $b > 191)
                    {
                        $ret .= $ret2;
                        $ret2 = "";
                        $i += $bytes-1;
                        $bytes = 1;
                        break;
                    }
                    else
                        $ret2 .= $str[$i];
                    $bytes--;
                }
            }
        }
        return $ret;
    }

    private static function __iconv($string, $charset)
    {
        return iconv ($charset, "UTF-8", $string);
    }
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131