6

I'm trying to test my code to make sure it only accepts utf-8 characters. The user is able to send a name as input and i want to make sure this name is not anything other that utf8.

I know that you can build a non utf8 character in the binary format but as far as I know the user can't send the input in the binary format. what is an example of a character that user can type in that is not supported in utf-8.

BTW I'm writing my code in php and the default encoing is utf-8.

Ali_IT
  • 7,551
  • 8
  • 28
  • 44
  • you can find more about utf-8 : http://stackoverflow.com/questions/1319022/really-good-bad-utf-8-example-test-data – Hardy Mathew Apr 16 '15 at 15:48
  • 3
    any single byte with the high bit set, e.g. `0xF?` would be invalid utf8. – Marc B Apr 16 '15 at 15:49
  • @MarcB if this is passed as the input, wouldn't input count each one of these as different characters? Like considering 0xF as three characters of '0', 'x' and 'F'? – Ali_IT Apr 16 '15 at 15:52
  • 2
    That is just a notation for bytes. How else do you represent a byte in writing? He means any standalone byte (without a following matching sequence) in the range `0xF0` to `0xFF`. – deceze Apr 16 '15 at 15:55
  • You should also test UTF-8 sequences that aren't valid characters. – Ignacio Vazquez-Abrams Apr 16 '15 at 16:18
  • @deceze I know that 0xF is just a presentation but how else can i pass this bytes as part of an input to my program? my program needs an input, how can you pass this non-utf8 character as the input? – Ali_IT Apr 16 '15 at 16:46
  • 3
    If you're writing PHP, `$str = "\xF0" ` will produce such a byte. – deceze Apr 16 '15 at 18:12

1 Answers1

0

If you're looking for strings to test against, here are a few:

  • İnanç Esasları
  • hello wörld
  • pythön!

A simple function to check if utf8 is in your string:

if (mb_strlen($var, 'UTF-8') != strlen($var)) {
  // handle when not UTF-8
}
fmitchell
  • 891
  • 1
  • 6
  • 7