2

We have a CGI-program, that processes POST-ed forms. Some of the POST-ed text can contain non-ASCII characters — and the browsers already helpfully convert these to UTF-8.

I need to "harden" the program to reject invalid strings — where a non-ASCII string is not a valid UTF-8 string either.

I thought, I'd rely on mbstowcs():

setlocale(LC_CTYPE, "en_US.UTF-8");
unilen = mbstowcs(NULL, foo, 0);
if (unilen == (size_t)-1) {
    ... report an error ...
}

However, I am having a hard time validating the method — it accepts valid strings alright, but I can't come up with an invalid one for it to reject...

Could someone, please, confirm, that this is a proper way and/or suggest an alternative?

Note, that I don't care for the actual result of conversion — once I'm confident, that the string is valid UTF-8, I'm copying it into an e-mail (with UTF-8 charset) and letting the recipient's e-mail program deal with it. The only reason I bother with the validation is to ensure, the form is not used to propagate arbitrary binaries (such as viruses).

Thanks!

Mikhail T.
  • 3,043
  • 3
  • 29
  • 46

1 Answers1

0

The function documentation says

"If an invalid multibyte character is encountered, a (size_t)-1 value is returned."

So i believe your validation is pretty much fine. Personally, i always found this value corrupted for invalid data. You might submit an arbitrary hex sequence of even length to be certain.

If you are doubtful and need further validation, gnu iconv is a good alternate

utf-8 validation on SO

Community
  • 1
  • 1
fkl
  • 5,412
  • 4
  • 28
  • 68