61

After studying HTTP/1.1 standard, specifically page 31 and related I came to conclusion that any 8-bit octet can be present in HTTP header value. I.e. any character with code from [0,255] range.

And yet HTTP servers I tried refuse to take anything with code > 127 (or most US-ASCII non-printable chars).

Here is dried out excerpt of grammar used in standard:

message-header = field-name ":" [ field-value ]
field-name     = token
field-value    = *( field-content | LWS )
field-content  = <the OCTETs making up the field-value and consisting of
                  either *TEXT or combinations of token, separators, and
                  quoted-string>

CR             = <US-ASCII CR, carriage return (13)>
LF             = <US-ASCII LF, linefeed (10)>
SP             = <US-ASCII SP, space (32)>
HT             = <US-ASCII HT, horizontal-tab (9)>
CRLF           = CR LF
LWS            = [CRLF] 1*( SP | HT )
OCTET          = <any 8-bit sequence of data>
CHAR           = <any US-ASCII character (octets 0 - 127)>
CTL            = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
TEXT           = <any OCTET except CTLs, but including LWS>

token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\"
               | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT

quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext         = <any TEXT except <">>
quoted-pair    = "\" CHAR

As you can see field-content can be a quoted-string, which is an enquoted sequence of TEXT (i.e. any 8-bit octet with exception of " and values from [0-8, 11-12, 14-31, 127] range) or quoted-pair (\ followed by any value from [0, 127] range). I.e. any 8-bit char sequence can be passed by en-quoting it and prefixing special symbols with \).

(Note that standard doesn't treat NUL(0x00) char in any special way)

But, obviously either all servers I tried are not conforming or standard has changed since 1999 or I can't read it properly.

So... which characters are allowed in HTTP header values and why?

P.S. Reason behind all of this: I am looking for a way to pass utf-8-encoded sequence in HTTP header value (without additional encoding, if possible).

C.M.
  • 3,071
  • 1
  • 14
  • 33
  • Looks like no one really took this part of standard seriously. I ended up simply [url-encoding](https://en.wikipedia.org/wiki/Percent-encoding) header values. – C.M. Dec 08 '17 at 03:01
  • Note that `separators` in `field-names` need to be encoded too. Also, if you use WinHTTP -- you'll have to encode single quote symbol in `field-name`, or request will fail. – C.M. Dec 18 '17 at 23:04
  • Hint: RFC 2616 is entirely irrelevant. Please see RFC 7230. – Julian Reschke Jan 07 '18 at 15:29
  • @JulianReschke I didn't know about RFC 7230. Is it an official HTTP/1.1 standard? (asking because greenbytes link points to a document marked "PROPOSED STANDARD") – C.M. Jan 08 '18 at 00:09
  • 4
    RFC 7230 did not rewrite RFC 2616 - it clarifed it _thankfully_. [https://tools.ietf.org/html/rfc7230#section-3.2] (§3.2) uses the token VCHAR to specify the allowable field-contents; VCHAR is defined in [https://tools.ietf.org/html/rfc7230#section-1.2] (§1.2) as any visible USASCII character. This clarified token removes the need to spend time culling out non-visible characters like RFC 2616 did, but **does not expand** the 1999/1982 definition to include 128-255. The OP's question is "which characters are allowed in HTTP header values and why". I have answered that, with references. – Geek Stocks Jan 08 '18 at 02:21
  • 1
    @C.M. - yes, see https://www.rfc-editor.org/info/rfc2616 – Julian Reschke Jan 08 '18 at 14:16
  • Does this answer your question? [What character encoding should I use for a HTTP header?](https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header) – miken32 Dec 09 '20 at 23:34
  • @miken32 No, not really -- I don't care how HTTP header is encoded, my question was about characters allowed in certain part of HTTP header. – C.M. Dec 10 '20 at 01:47
  • I try to accept dynamic header value in python flask, the question is SQL value can abused by user and hacker can add new key and value to the query. Can same thing happened in header value , for eg sql injection maybe like this SELECT admin from users where username=input_value; , hacker can make the input_value equal to '; SELECT * from users WHERE name='hacker' -- , can same done and result in setting new header key and value by only providing dynamic value in specific key thanks, – Mahmoud Magdy Aug 14 '23 at 07:49

3 Answers3

18

RFC 2616 is obsolete, the relevant part has been replaced by RFC 7230.

The NUL octet is no longer allowed in comment and quoted-string text, and handling of backslash-escaping in them has been clarified. The quoted-pair rule no longer allows escaping control characters other than HTAB. Non-US-ASCII content in header fields and the reason phrase has been obsoleted and made opaque (the TEXT rule was removed). (Section 3.2.6)

In essence, RFC 2616 defaulted to ISO-8859-1, and this was both insufficient and not interoperable anyway. Thus, RFC 7230 has deprecated non-ASCII octets in field values. The recommendation is to use an escaping mechanism on top of that (such as defined in RFC 8187, or plain URI-percent-encoding).

Community
  • 1
  • 1
Julian Reschke
  • 40,156
  • 8
  • 95
  • 98
  • 35
    Is RFC 2616 obsolete? Yes. Does that answer the OP's question of "which characters are allowed in HTTP header values and why"? No. – Geek Stocks Jan 08 '18 at 02:46
  • 4
    Non-ASCII characters are deprecated. You can send them, but there's no guarantee that the recipient will do what you expect it to. That's what the spec says, and that's the answer :-) – Julian Reschke Jan 08 '18 at 14:15
  • 2
    @JulianReschke I finally got around to read RFC 7230. I don't see any "obsoletion" of non-US-ASCII content in [p3.2.6](https://www.greenbytes.de/tech/webdav/rfc7230.html#rfc.section.3.2.6) -- it seems it allows any `0x80-0xFF` char in `quoted-string`. `0x00-0x7F` range got decimated though. I.e. according to this standard you can pass utf-8 data in header value as long as you escape "forbidden" part of `0x00-0x7F` range. Am I wrong? – C.M. Jan 14 '18 at 00:18
  • `field-name` can contain `'` too... I guess this special case will have to remain in my code if I care about MS webservers. – C.M. Jan 14 '18 at 00:28
  • 1
    "As a convention, ABNF rule names prefixed with "obs-" denote "obsolete" grammar rules that appear for historical reasons." - https://www.greenbytes.de/tech/webdav/rfc7230.html#rfc.section.1.2.p.3 – Julian Reschke Jan 14 '18 at 07:05
  • @JulianReschke Ah, I see... So, using `0x80-0xFF` range is OK now (but disencouraged), but will be removed in future HTTP versions? I.e. back to my original question -- with HTTP/1.1 it should be ok to pass utf-8 encoded data as header values (double-enquoted with double quotes and backslash escaped) as long as it doesn't contain forbidden values from `0x00-0x7F` range? – C.M. Jan 15 '18 at 23:29
  • @JulianReschke ... i.e. what meaning standard's authors put into "obsolete"? – C.M. Jan 16 '18 at 05:36
  • Note that RFC 9110, which obsoletes RFC 7230, has an updated discussion on this which I found helpful: https://www.rfc-editor.org/rfc/rfc9110.html#section-5.5 – recvfrom Dec 28 '22 at 16:18
6

For all the people like me who came here for the title "what characters are allowed in HTTP header values?"

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~
!#$&'()*+,/:;=?@[]
%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2F,%3A,%3B,%3D,%3F,%40,%5B,%5D

from @C.M.'s comment mentioning wikipedia's url encoding

  • RFC 3986 section 2.2 Reserved Characters (January 2005)

!#$&'()*+,/:;=?@[]

  • RFC 3986 section 2.3 Unreserved Characters (January 2005)

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~

  • Reserved characters after percent-encoding ( ␣ == " " )
! " # $ % & ' ( ) * + , / : ; = ? @ [ ]
%20 %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C %2F %3A %3B %3D %3F %40 %5B %5D
  • and a helpful json for our python code:
[
    ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-', '_', '.', '~'], 
    ['!', '#', '$', '&', "'", '(', ')', '*', '+', ',', '/', ':', ';', '=', '?', '@', '[', ']'], 
    {' ': '%20', '!': '%21', '"': '%22', '#': '%23', '$': '%24', '%': '%25', '&': '%26', "'": '%27', '(': '%28', ')': '%29', '*': '%2A', '+': '%2B', ',': '%2C', '/': '%2F', ':': '%3A', ';': '%3B', '=': '%3D', '?': '%3F', '@': '%40', '[': '%5B', ']': '%5D'}
]

I believe Julian Reschke answered the OP's "why"

Daniel Olson
  • 73
  • 1
  • 3
-3

It looks as if there is an error in the HTTP/1.1 specs. As you pointed out, §4.2 describes the field content as OCTET:

field-content = the OCTETs making up the field-value

And OCTET is defined in §2.2 as:

OCTET = any 8-bit sequence of data

These lines are the basis of your conclusion that octets > 127 should be allowed, and certainly I see how you have drawn that conclusion. The mention of OCTET in §4.2 is the misleading error; it should be CHAR.

If you read §4.2 (Message Headers) from the beginning, you will note the following guidance:

HTTP header fields...follow the same generic format as that given in Section 3.1 of RFC 822

If we do as instructed and go to RFC 822, specifically §3.1.2 (Structure of header fields), we learn the following:

The field-name must be composed of printable ASCII characters (i.e., characters that have values between 33. and 126., decimal, except colon). The field-body may be composed of any ASCII characters, except CR or LF.

So while HTTP/1.1 was written in 1999, they used a definition from 1982 to describe the field contents. In 1982, characters 0-127 were called "ASCII" and 128-255 were called "Extended ASCII". Now, in this answer I am not going to get involved in the food fight that gets evoked when using the term "Extended ASCII". I will simply point you to §3.3 of RFC 822 for the definition of what was then considered "any ASCII character":

CHAR = any ASCII character ( Octal: 0-177, Decimal: 0.-127.)

And so there you have it - the smoking gun. "ASCII" stopped at 127 in 1982. The written paragraph portion of RFC 2616 §4.2 points you in the right direction, and the unfortunate later misuse of the token OCTET in that same section led you down this rabbit hole.

Geek Stocks
  • 2,010
  • 3
  • 27
  • 43
  • 3
    That interpretation is wrong, see specifically . – Julian Reschke Jan 07 '18 at 15:41
  • I agree with Geek Stocks that there is a deficiency or misrepresentation in RFC 2616. But @JulianReschke seems to be correct too -- it seems there was a conscious attempt to include non-ASCII characters. I guess standard was written by multiple people with different views on the subject matter? – C.M. Jan 08 '18 at 00:23
  • @JulianReschke It is **not** wrong. (1) In 1999 RFC 2616 defined the contents as decimal 0...127 and culling out the non-visible in that range — that is indisputable and I have shown that. (2) RFC 7230 **did not expand** the allowable characters to include non-visible ASCII > 127. Your link is just a copy of RFC 2616. – Geek Stocks Jan 08 '18 at 02:34
  • 4
    @GeekStocks - you are drawing an incorrect conclusion. RFC 2616 indeed allowed non-ASCII characters. RFC 7230 has deprecated them due to the reasons I mentioned (and I should know, I'm one of the authors). "follows the format" is an explanation where the format originated from; it's not a normative reference. – Julian Reschke Jan 08 '18 at 14:21
  • 2
    @JulianReschke - this is truly getting laughable. Let's see if I can put a fork in this. The OP states "...servers I tried refuse to take anything with code > 127". Your own link to RFC 2616 §2.2 shows **why** the OP can't send 128...255. It states _The US-ASCII coded character set is defined by ANSI X3.4-1986 [footnote 21]_ . Go to footnote 21. It is a citation to **7-bit American Standard Code**. Now, tell me how you get a number >127 with only 7 bits? _(drops the mic)_ ;-p – Geek Stocks Jan 08 '18 at 15:17
  • 5
    In RFC 2616, the ABNF for "TEXT" is "". OCTET is defined as "". In addition to that, RFC 2616 very clearly says: "Words of \*TEXT MAY contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 [14]." - so characters from ISO-8859-1 (which is a super set of US-ASCII) *can* be used in TEXT. I think that's pretty clear. The reference to US-ASCII applies to the ABNF rules that say "US-ASCII", not to OCTET. – Julian Reschke Jan 08 '18 at 15:22
  • 7
    I just want to know what the valid chars are in an HTTP Header value. Untangling a moving target of self-referential ABNF docs is not productive. We need a working reference implementation with unit tests to clarify edge cases to fix this hell. Why there are so many different ways of handling meta data? Is this to create job security for web developers or create new security attack surfaces for security companies to fix? – Systemsplanet Feb 05 '20 at 06:40