2

I am currently working on a website which accepts input in English, Russian and Ukrainian.

Users often submit forms with characters like trademark sign (™), Japanese letters (の) and German letters (Ö).

That's fine, but sometimes when they copy-paste these characters from somewhere they submit input like (0xD8000xDC00), � (0xFFFD), (0x17), ¿ (0xBF), ½ (0xBD), and ï (0xEF) (by the way there's a Ukrainian letter 'ї' which value is 0x457).

Later, when that input is being converted in a UTF-8 XML it throws this error "Input is not proper UTF-8, indicate encoding ! Bytes: 0x17 0xEF 0xBF 0xBD, line 13330, column 27".

Is there a way to validate these 'broken' characters in user input?

I was thinking about converting every character from input string to HEX value, and then compare with an array which contains all the illegal HEX values. But in this approach the problem is I don't know all the possible codes for 'broken' characters. I know that 0xEF 0xBF 0xBD appear often but I don't know how many more are there.

Any suggestions?

Roman
  • 1,267
  • 2
  • 13
  • 20
  • Possible duplicate of [Remove non-utf8 characters from string](http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string) – iainn Aug 30 '16 at 14:13

2 Answers2

2

If the web page containing the form is encoded as UTF-8, every modern browser should submit form fields encoded as valid UTF-8. (You should still verify that on the server though.) I think what's happening here is something different. The byte sequence

0x17 0xEF 0xBF 0xBD

is valid UTF-8: U+0017 END OF TRANSMISSION BLOCK followed by U+FFFD REPLACEMENT CHARACTER. But you mentioned XML processing, and U+0017 is invalid in XML 1.0. XML 1.0 only allows

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

(XML 1.1 lifts this restriction partially.) I'd suggest to replace ASCII control characters that aren't allowed in XML with the replacement character before passing them to XML processing functions:

preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F]/', "\xEF\xBF\xBD", $value);

Or, also including U+FFFE and U+FFFF:

preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x{FFFE}\x{FFFF}]/u', "\xEF\xBF\xBD", $value);
nwellnhof
  • 32,319
  • 7
  • 89
  • 113
-2

Maybe iso-8859-1 works.

I don't know if this is the answer, you can try it tough.

Vincent Toonen
  • 74
  • 1
  • 10