We have a CGI-program, that processes POST-ed forms. Some of the POST-ed text can contain non-ASCII characters — and the browsers already helpfully convert these to UTF-8.
I need to "harden" the program to reject invalid strings — where a non-ASCII string is not a valid UTF-8 string either.
I thought, I'd rely on mbstowcs():
setlocale(LC_CTYPE, "en_US.UTF-8");
unilen = mbstowcs(NULL, foo, 0);
if (unilen == (size_t)-1) {
... report an error ...
}
However, I am having a hard time validating the method — it accepts valid strings alright, but I can't come up with an invalid one for it to reject...
Could someone, please, confirm, that this is a proper way and/or suggest an alternative?
Note, that I don't care for the actual result of conversion — once I'm confident, that the string is valid UTF-8, I'm copying it into an e-mail (with UTF-8 charset) and letting the recipient's e-mail program deal with it. The only reason I bother with the validation is to ensure, the form is not used to propagate arbitrary binaries (such as viruses).
Thanks!