4

I do NOT want to check if a string in Python is in ASCII. :)

There is an interesting requirement in the HTTP Specification and I was wondering how it could be implemented and tested.

Recipients MUST parse an HTTP message as a sequence of octets in an encoding that is a superset of US-ASCII [USASCII].

Parsing an HTTP message as a stream of Unicode characters, without regard for the specific encoding, creates security vulnerabilities due to the varying ways that string processing libraries handle invalid multibyte character sequences that contain the octet LF (%x0A).

In another stackoverflow answer, there is an example of character set which is not a superset of US-ASCII. But I was interested more on the side of testing that requirement. OR kind of testing. The requirement just means that the parser has to pick up a superset of ASCII for swallowing the data, but I was wondering in the case where you want to check before if there is any strange characters inside the message.

Let's say a message MSG.

def is_ascii_superset(self, MSG):
    "take any string, and return True or False"
    # Test here
    if test(MSG):
        return True
    else:
        return False

Any ideas if there is a list of all character sets which are a superset of ASCII?

UPDATE:

People seem to misunderstand the question. I'm not talking about finding if a string is part of ASCII. This is trivial.

  • ISO-8859-1, UTF-8, etc. are supersets of ASCII.
  • JIS X 0208 is NOT a superset of ASCII.
Community
  • 1
  • 1
karlcow
  • 6,977
  • 4
  • 38
  • 72

1 Answers1

3

You don't have to test for it, you just have treat everything like it is a superset of ASCII, e.g. always treat %x0A as LF, assume characters below %x7F are ASCII, and don't try to parse multibyte sequences. A superset of ASCII uses every value of a byte, there are no "strange" characters.

Pavel Anossov
  • 60,842
  • 14
  • 151
  • 124
  • This is the requirement from the specification for the parser. I get that. :) What I want to know is: is it testable? If yes, how. Maybe it's just not testable and it's ok. – karlcow Mar 11 '13 at 22:16
  • It's not testable in general. – Pavel Anossov Mar 11 '13 at 22:19
  • When reading your answer is what I came to conclude because could just treat a specific code point for another character. It makes sense. Thanks Pavel. – karlcow Mar 11 '13 at 22:24