I am currently working on a website which accepts input in English, Russian and Ukrainian.
Users often submit forms with characters like trademark sign (™), Japanese letters (の) and German letters (Ö).
That's fine, but sometimes when they copy-paste these characters from somewhere they submit input like (0xD8000xDC00), � (0xFFFD), (0x17), ¿ (0xBF), ½ (0xBD), and ï (0xEF) (by the way there's a Ukrainian letter 'ї' which value is 0x457).
Later, when that input is being converted in a UTF-8 XML it throws this error "Input is not proper UTF-8, indicate encoding ! Bytes: 0x17 0xEF 0xBF 0xBD, line 13330, column 27".
Is there a way to validate these 'broken' characters in user input?
I was thinking about converting every character from input string to HEX value, and then compare with an array which contains all the illegal HEX values. But in this approach the problem is I don't know all the possible codes for 'broken' characters. I know that 0xEF 0xBF 0xBD appear often but I don't know how many more are there.
Any suggestions?