How do I detect if have to apply UTF-8 decode or encode on a string?

Question

I have a feed taken from third-party sites, and sometimes I have to apply utf8_decode and other times utf8_encode to get the desired visible output.

If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.

How can I detect when what have to apply on the string?

Actually the content returns UTF-8, but inside there are parts that are not.

Should we assume that the feed declares certain charset but uses another one? — Álvaro González, Dec 10 '10 at 10:31

score 64 · Accepted Answer · edited Jul 08 '19 at 13:38

64

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

edited Jul 08 '19 at 13:38

Peter Mortensen

30,738
21
105
131

answered Dec 10 '10 at 10:42

bisko

3,948
1
27
29

3

+1 Implemented a utf8_validate() that uses your solution to convert a string to utf8 if it isn't, works as a charm! – Max Kielland Feb 03 '11 at 22:44
4

Thank you! That's a very clever trick ;-) Since I had absolutely no friggin' clue how it worked, I delved into the PHP documentation to find [this](http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php): `u (PCRE8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5. ` Anyway, thanks a bunch! – Edward Mar 02 '11 at 17:32
2

that dot in regexp is not even needed `preg_match('!!u', $str)` works ok – rsk82 Jul 17 '11 at 12:39
1

That dot will make it even return 0 (equals false) for empty strings. But empty strings are valid UTF-8 ;). – hakre Jul 17 '11 at 14:01
5

"It's just an empty regular expression. ! is the delimiter and u is the modfier." The solution is indeed clever, but needed some more verbose explanation, so I asked about it - http://stackoverflow.com/questions/10855682/explain-this-utf-8-detection-regex – starlocke Jun 01 '12 at 19:29
Yupp, had a false positive right now on a pretty obvious CSV file. This solution works. – soger Jun 16 '22 at 15:30

score 7 · Answer 2 · answered May 13 '15 at 07:01

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("Â« ChrÃ©tiens d'Orient Â» : la RATP fait marche arriÃ¨re"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)

score 4 · Answer 3 · edited Jul 08 '19 at 13:39

You can use

mb_detect_encoding — Detect character encoding

The character set might also be available in the HTTP response headers or in the response data itself.

Example:

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

Output (codepad):

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

score 0 · Answer 4 · edited Jul 08 '19 at 13:39

0

The feed (I guess you mean some kind of XML-based feed) should have an attribute in the header telling you what the encoding is. If not, you are out of luck as you don't have a reliable means of identifying the encoding.

edited Jul 08 '19 at 13:39

Peter Mortensen

30,738
21
105
131

answered Dec 10 '10 at 10:30

Femaref

60,705
7
138
176

score 0 · Answer 5 · answered Dec 10 '10 at 10:34

0

Encoding autotection is not bullet-proof but you can try mb_detect_encoding(). See also mb_check_encoding().

answered Dec 10 '10 at 10:34

Álvaro González

142,137
41
261
360

How do I detect if have to apply UTF-8 decode or encode on a string?

5 Answers5

Linked