3

After converting my site to use utf-8, I'm now faced with the prospect of validating all incoming utf data, to ensure its valid and coherent.

There seems to be various regexp's and PHP API to detect whether a string is utf, but the ones Ive seen seem incomplete (regexps which validate utf, but still allow invalid 3rd bytes etc).

I'm also concerned about detecting (and preventing) overlong encoding, meaning ASCII characters that can be encoded as multibyte utf sequences.

Any suggestions or links welcome!

Eddy Freddy
  • 1,820
  • 1
  • 13
  • 18
carpii
  • 1,917
  • 4
  • 20
  • 24
  • It seems it's illegal to encode an ASCII character using a surrogate pair in UTF-8. Trying to decode the surrogate pair for 'a' with `(chr(0b11000001) + chr(0b10000001)).decode('utf-8')` makes Python complain. – millimoose Oct 23 '11 at 21:56
  • @Kerrek, ok thanks for the pointer. Im still finding my way around stackoverflow. – carpii Oct 23 '11 at 22:53

2 Answers2

8

mb_check_encoding() is designed for this purpose:

mb_check_encoding($string, 'UTF-8');
BenMorel
  • 34,448
  • 50
  • 182
  • 322
  • 1
    +1, this is the better solution. Should have looked into that before starting tinkering with `iconv`. – Jon Oct 23 '11 at 21:55
  • 1
    Dependent on PHP version (check the link in the answer). – Jared Farrish Oct 23 '11 at 21:56
  • If needed, I wrote a while ago a pure PHP version, that you can find [here](http://www.php.net/manual/en/function.utf8-encode.php#39986) (there's room for improvement, but it works.) – BenMorel Oct 23 '11 at 21:59
  • Thanks Benjamin, doing some more testing but it does seem mb_check_encoding is handling everything I can throw at it (including overlong sequences). A PHP API without caveats, can it really be?! :) – carpii Oct 23 '11 at 23:20
1

You can do a lot of things with iconv that can tell you if the sequence is valid UTF-8.

Telling it to convert from UTF-8 to the same:

$str = "\xfe\x20"; // Invalid UTF-8
$conv = @iconv('UTF-8', 'UTF-8', $str);
if ($str != $conv) {
    print("Input was not a valid UTF-8 sequence.\n");
}

Asking for the length of the string in bytes:

$str = "\xfe\x20"; // Invalid UTF-8
if (@iconv_strlen($str, 'UTF-8') === false) {
    print("Input was not a valid UTF-8 sequence.\n");
}
Jon
  • 428,835
  • 81
  • 738
  • 806
  • @JaredFarrish: Because it emits notices on encountering the invalid sequence. – Jon Oct 23 '11 at 21:54
  • Oh the horror! A legitimate use of the `@` suppressor? I must be dreaming. `;)` – Jared Farrish Oct 23 '11 at 21:57
  • @JaredFarrish: This is small change. Read [this one](http://stackoverflow.com/questions/2702744/cheking-and-error-on-a-php-function/2702909#2702909) and tell me... ;) – Jon Oct 23 '11 at 22:00
  • What do you mean by a *small change*? (Nice answer in the linked question, btw. Very interesting, just not sure what I was supposed to infer.) – Jared Farrish Oct 23 '11 at 22:09
  • @JaredFarrish: If I 'm not mistaken "small change" means "not a great deal" (English is not my native lang). That other answer was what I 'd call a "legitimate use of the @ operator". – Jon Oct 23 '11 at 22:10
  • Yes, you're right; that's what it means. I'm just not sure what you mean by it's use here, in the context of what many tend to say about this operator ("Absolutely not! Never! DON'T USE UNDER ANY CIRCUMSTANCES!"). This may be a "small thing" here, but it's useful, and doesn't appear to have any ill or unwarranted effects that I can tell. – Jared Farrish Oct 23 '11 at 22:13
  • So I was being [facetious](http://dictionary.reference.com/browse/facetious) about those who by reflex denounce any and all use of the `@` operator. `:)` – Jared Farrish Oct 23 '11 at 22:15
  • Is there a reason I would choose to use iconv over mb_check_encoding? Ie, any edge cases where iconv would fail a bad sequence that mb_check_encoding would consider valid? – carpii Oct 23 '11 at 23:19
  • @carpii: None that I know of, and none that would make sense. Go with `mb_check_encoding`. – Jon Oct 23 '11 at 23:21