2

I'm using .NET 4.5 and I'm trying to parse a URI query string into a NameValueCollection. The right way seems to be to use HttpUtility.ParseQueryString(string query) which takes the string obtained from Uri.Queryand returns a NameValueCollection. Uri.Query returns a string that is escaped according to RFC 2396, and HttpUtility.ParseQueryString(string query) expects a string that is URL-encoded. Assuming RFC 2396 and URL-encoding are the same thing, this should work fine.

However, the documentation for ParseQueryString claims that it "uses UTF8 format to parse the query string". There is also an overloaded method which takes a System.Text.Encoding and then uses that instead of UTF8.

My question is: what does it mean to use UTF8 as the encoding? The input is a string, which by definition (in C#) is UTF-16. How is that interpreted as UTF-8? What is the difference between using UTF8 and UTF16 as the encoding in this case? My concern is that since I'm accepting arbitrary user input, there might be some security risk if I botch the encoding (i.e. the user might be able to slip through some script exploit).

There is a previous question on this topic (How to parse a query string into a NameValueCollection in .NET) but it doesn't specifically adress the encoding problem.

Community
  • 1
  • 1
Sten L
  • 1,772
  • 1
  • 12
  • 13

1 Answers1

7

When parsing encoded values, it treats those values as UTF-8. Take the character ¢, for example. The UTF-8 encoding is C2 A2. So if it were in a query string, it would be encoded as %C2%A2.

Now, when ParseQueryString is decoding, it needs to know what encoding to use. The default is UTF-8, meaning that the character would be decoded correctly. But perhaps the user was using Microsoft's Cyrillic code page (Windows-1251), where C2 and A2 are two different characters. In that case, interpreting it as UTF-8 would be an error.

If this is a user interface application (i.e. the user is entering data directly), then you probably want to use whatever encoding is defined for the current UI culture. If you're getting this information from Web pages, then you'll want to use whatever encoding the page uses. And if you're writing a Web service then you can tell the users that their input has to be UTF-8 encoded.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Perfect answer. Thanks. I now found that this is explained in the relevant RFC, section 2.1: ["URI and non-ASCII characters"](http://www.ietf.org/rfc/rfc2396.txt). To summarize, there is no way of knowing what encoding the client (user) intended, and no way for the server to communicate what encoding is expected in a URI. This information has to be provided out-of-band. – Sten L Apr 20 '12 at 07:48