5

I've found a useful function on another answer and I wonder if someone could explain to me what it is doing and if it is reliable. I was using mb_detect_encoding(), but it was incorrect when reading from an ISO 8859-1 file on a Linux OS.

This function seems to work in all cases I tested.

Here is the question: Get file encoding

Here is the function:

function isUTF8($string){
    return preg_match('%(?:
    [\xC2-\xDF][\x80-\xBF]              # Non-overlong 2-byte
    |\xE0[\xA0-\xBF][\x80-\xBF]         # Excluding overlongs
    |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # Straight 3-byte
    |\xED[\x80-\x9F][\x80-\xBF]         # Excluding surrogates
    |\xF0[\x90-\xBF][\x80-\xBF]{2}      # Planes 1-3
    |[\xF1-\xF3][\x80-\xBF]{3}          # Planes 4-15
    |\xF4[\x80-\x8F][\x80-\xBF]{2}      # Plane 16
    )+%xs', $string);
}

Is this a reliable way of detecting UTF-8 strings? What exactly is it doing? Can it be made more robust?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Gary Willoughby
  • 50,926
  • 41
  • 133
  • 199
  • 2
    Why not use something like `mb_detect_encoding` (http://php.net/manual/en/function.mb-detect-encoding.php)? – summea Mar 14 '12 at 23:10
  • 1
    Just want to mention that this function thinks that "1" string is not utf8, while it is (to be clear it is just an ascii, but it is still should be incorporated into utf8) – zerkms Mar 14 '12 at 23:11
  • @summea did you read the question at all? – Gary Willoughby Mar 14 '12 at 23:23
  • @GaryWilloughby Did read the question, but I don't recall seeing that you were using `mb_detect_encoding` earlier; sorry about that. I still think it's worth using `mb_detect_encoding` here, though... even if it's wrapped in something else. Check out this comment by Greg Tisza as well, if you have the chance, about using the "strict mode" (http://www.php.net/manual/en/function.mb-detect-encoding.php#102510) – summea Mar 14 '12 at 23:55
  • It should be noted that the function posted in the question does NOT actually detect if an arbitrary string is valid UTF-8. It only detects if the string CONTAINS "non-ascii multibyte sequences in the UTF-8 range". So a plain ascii string like "hello world" would fail the test. See my answer below for a more detailed explanation of where that function came from. – jnrbsn Mar 23 '12 at 19:31
  • Turns out php `strlen` return the octal length of a string, exactly like `count(str_split($string))` If the string is Utf8encoded ( containing accents or special caracters utf8 encoded ), this length will be greater than the seen one so this performs the trick and especially avoid performing utf8 encode twice over a string `$isUtf=(mb_strlen($string) != strlen($string))` – Jack Aug 13 '21 at 07:03

6 Answers6

6

If you do not know the encoding of a string, it is impossible to guess the encoding with any degree of accuracy. That's why mb_detect_encoding simply does not work. If however you know what encoding a string should be in, you can check if it is a valid string in that encoding using mb_check_encoding. It more or less does what your regex does, probably a little more comprehensively. It can answer the question "Is this sequence of bytes valid in UTF-8?" with a clear yes or no. That doesn't necessarily mean the string actually is encoded in that encoding, just that it may be. For example, it'll be impossible to distinguish any single-byte encoding using all 8 bits from any other single-byte encoding using 8 bits. But UTF-8 should be rather distinguishable, though you can produce, for instance, Latin-1 encoded strings that also happen to be valid UTF-8 byte sequences.

In short, there's no way to know for sure. If you expect UTF-8, check if the byte sequence you received is valid in UTF-8, then you can treat the string safely as UTF-8. Beyond that there's hardly anything you can do.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • I can test if the string is a valid UTF-8 string, but "Hello World" will also pass that test even if it is in ASCII, but can you validate in a way that'll tell you if it is a valid latin-1 or ascii, and NOT a UTF-8? – A.Grandt Jan 20 '13 at 13:33
  • 2
    "Hello World" is valid ASCII *and* Latin-1 *and* UTF-8! – deceze Jan 20 '13 at 14:08
  • I got around it by testing it if is a UTF-8, if it is and it then fails to test for a valid ASCII, I assume it's ASCII, else it's a UTF-8. I needed this as I need to set a flag in the Zip file header structure if it IS a UTF-8, but should not do this if it is not. – A.Grandt Jan 21 '13 at 17:21
  • 1
    "Hello World" is valid in **almost any encoding out there!!** If `mb_check_encoding` says it's valid in a certain encoding, then it is! They're all equally valid! Take your pick! Testing for ASCII and if it fails you assume it's ASCII nonetheless makes no sense whatsoever. – deceze Jan 22 '13 at 03:21
  • 2
    [What you really need to know about encodings before working with them](http://kunststube.net/encoding). – deceze Jan 22 '13 at 03:22
  • Thanks for the link. Btw, can you even make a string that tests as invalid ASCII? – A.Grandt Jan 23 '13 at 12:00
0

Well, it only checks if the string has byte sequences that happen to correspond to valid UTF-8 code points. However, it won't flag the sequence 0x00-0x7F which is the ASCII compatible subset of UTF-8.

EDIT: Incidentally I am guessing the reason thought mb_detect_encoding() "didn't work properly" was because your Latin-1 encoded file only used the ASCII compatible subset, which also is valid in UTF-8. It's no wonder that mb_detect_encoding() would flag that as UTF-8 and it is "correct", if the data is just plain ASCII then the answer UTF-8 is as good as Latin-1, or ASCII, or any of the myriad extended ASCII encodings.

user268396
  • 11,576
  • 2
  • 31
  • 26
  • The problem is I need to do further encoding so i need to know exactly what it is before I encode it again. And yes, it's problems with the extended ASCII set. – Gary Willoughby Mar 14 '12 at 23:25
0

That will just detect if part of the string is a formally valid UTF-8 sequence, ignoring one code unit encoded characters (representing code points in ASCII). For that function to return true it suffices that there's one character that looks like a non-ASCII UTF-8 encoded character.

Artefacto
  • 96,375
  • 17
  • 202
  • 225
0

This may not be the answer to your question (maybe it is, see the update below), but it could be the answer to your problem. Check out my Encoding class that has methods to convert strings to UTF8, no matter if they are encoded in Latin1, Win1252, or UTF8 already, or a mix of them:

Encoding::toUTF8($text_or_array);
Encoding::toWin1252($text_or_array);
Encoding::toISO8859($text_or_array);

// fixes UTF8 strings converted to UTF8 repeatedly: 
//  "FÃÂédÃÂération" to "Fédération"
Encoding::fixUTF8($text_or_array);  

https://stackoverflow.com/a/3479832/290221

The function runs byte by byte and figure out if each one of them needs conversion or not.

Update:

Thinking a little bit more about it, this could in fact be the answer to your question:

require_once('Encoding.php');

function validUTF8($string){
    return Encoding::toUTF8($string) == $string;
}

And here is the Encoding class: https://github.com/neitanod/forceutf8

Community
  • 1
  • 1
Sebastián Grignoli
  • 32,444
  • 17
  • 71
  • 86
  • That can only every work on a best-guess basis. What if I *meant* to write "FÃÂédÃÂération"? Like on this very page here to demonstrate encoding problems. If you have screwed up text due to encoding mis-treatments, you need to fix those mis-treatments, not the text. – deceze Mar 15 '12 at 02:14
  • Exactly. That's why that last functionality is separated from toUTF8(), into it's own function. I made fixUTF8() to fix some files with a command line program. It's not intended for live websites. Nonetheless, I DO use toUTF8() on live websites. – Sebastián Grignoli Mar 15 '12 at 02:44
  • I also don't really understand the need to check each byte individually. Do you expect mixed-encoded strings? If so, your problem is elsewhere. Encodings are of such a nature that you really need to *know* what you're dealing with. You can't treat a string as a black box, you can only convert *from* one encoding *to* another. Without knowing what this *from* is you cannot reliably get the result you're looking for. While your class is quite a piece of work, I'd really recommend against using it. – deceze Mar 15 '12 at 03:23
  • Sorry for trying to drive this point home, but you're doing essential exactly that, right? You try to confirm whether the string is valid in UTF-8, if it isn't, you *assume* it's Latin-1 or possibly Windows-1252. You could have largely the same effect using `if (!mb_check_encoding($str, 'UTF-8')) $str = iconv('ISO-8859-1', 'UTF-8', $str)`. Add a `preg_match` check for typical Windows-1252 byte sequences to *attempt* to differentiate between ISO-8859-1 and Windows-1252, which will never be 100% accurate. Instead of doing all that, you'd be better off knowing your encodings. – deceze Mar 15 '12 at 03:40
  • First of all, I was dealing with mixed encodings as you guessed. I agree with you about the importance of knowing your input, but there are scenarios in which, sadly, you will never know for sure. I've seen files informing an encoding and using another. – Sebastián Grignoli Mar 15 '12 at 03:53
  • Your approach does a similar work, but it does not solve the mixed encoding problem. At the time I developed the original ForceUTF8 function I've been receiving a feed of data compiled from several sources, and the compiler (I mean, the provider) was doing a lousy work normalizing the encoding. – Sebastián Grignoli Mar 15 '12 at 03:57
  • "attempt to differentiate between ISO-8859-1 and Windows-1252, which will never be 100% accurate". I disagree with that part. Differentiate between ISO-8859-1 and Windows-1252 is so straightforward that HTML5 defines that files labeled as ISO-8859-1 should simply be treated as Windows-1252. No problem there. – Sebastián Grignoli Mar 15 '12 at 04:00
  • Then please clearly emphasize that this class is primarily meant to fix broken documents. The naming `Encoding::toUTF8` has the same problem as `utf8_encode`: It suggests something which is not true, which is that you do not need to think about encodings. I'd always rather reject invalid encodings and try to get the providers to fix them than trying to work with broken documents. Attempting auto-detection and best-guess conversion is a last resort, not a normal modus operandi. – deceze Mar 15 '12 at 04:02
  • Sometimes the users are the providers, and you cannot educate them, no matter how hard you try. Some other times the data providers are your clients, and making them change anything is asking them to spend money and resources. They'd might rather look for another service provider than spend time and money talking to you. Believe me, what you're calling a last resort is a much better solution than rejecting half the data you have to process or using it without any conversion at all. – Sebastián Grignoli Mar 15 '12 at 04:25
0

Basically, no.

  • Any UTF8 string is a valid 8-bit encoding string (even if it produces gibberish).
  • On the other hand, most 8-bit encoded strings with extended (128+) characters are not valid UTF8, but, as any other random byte sequence, they might happen to be.
  • And, of couse, any ASCII text is valid UTF8, so mb_detect_encoding is, in fact, correct by saying so. And no, you won't have any problems using ASCII text as UTF8. It's the reason UTF8 works in the first place.

As far as I understand, the function you supplied does not check for validity of the string, just that it contains some sequences that happen to be similar to those of UTF8, thus this function might misfire much worse. You may want to use both this function and mb_detect_encoding in strict mode and hope that they cancel out each others false positives.

If the text is written in a non-latin alphabet, a "smart" way to detect a multibyte encoding is to look for sequences of equally sized chunks of bytes starting with the same bits. For example, Russian word "привет" looks like this:

11010000 10111111
11010001 10000000
11010000 10111000
11010000 10110010
11010000 10110101
11010001 10000010

This, however, won't work for latin-based alphabets (and, probably, Chinese).

a sad dude
  • 2,775
  • 17
  • 20
0

The function in question (the one that the user pilif posted in the linked question) appears to have been taken from this comment on the mb_detect_encoding() page in the PHP Manual:

As the author states, the function is only meant to "check if a string contains UTF-8 characters" and it only looks for "non-ascii multibyte sequences in the UTF-8 range". Therefore, the function returns false (zero actually) if your string just contains simple ascii characters (like english text), which is probably not what you want.

His function was based on another function in this previous comment on that same page which is, in fact, meant to check if a string is UTF-8 and was based on this regular expression created by someone at W3C.

Here is the original, correctly working (I've tested) function that will tell you if a string is UTF-8:

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {

    // From http://w3.org/International/questions/qa-forms-utf-8.html
    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);

} // function is_utf8
Community
  • 1
  • 1
jnrbsn
  • 2,498
  • 1
  • 18
  • 25
  • By the way, one problem with using `mb_detect_encoding()` is that it does not support the "Mac OS Roman" (or "macintosh") character set, which is still somewhat commonly used on OS X. It will incorrectly identify it as UTF-8. – jnrbsn Mar 23 '12 at 16:34