94

The RFC2617 says to encode the username and password to base64 but don't say what character encoding to use when creating the octets for input into the base64 algorithm.

Should I assume US-ASCII or UTF8? Or has someone settled this question somewhere already?

Dobes Vandermeer
  • 8,463
  • 5
  • 43
  • 46
  • 2
    related: [HTTP header should use what character encoding?](http://stackoverflow.com/questions/4400678/http-header-should-use-what-character-encoding) – Hawkeye Parker Nov 12 '14 at 07:40

4 Answers4

89

Original spec - RFC 2617

RFC 2617 can be read as "ISO-8859-1" or "undefined". Your choice. It's known that many servers use ISO-8859-1 (like it or not) and will fail when you send something else. So probably the only safe choice is to stick to ASCII.

For more information and a proposal to fix the situation, see the draft "An Encoding Parameter for HTTP Basic Authentication" (which formed the basis for RFC 7617).

New - RFC 7617

Since 2015 there is RFC 7617, which obsoletes RFC 2617. In contrast to the old RFC, the new RFC explicitly defines the character encoding to be used for username and password.

  • The default encoding is still undefined. Is is only required to be compatible with US-ASCII (meaning it maps ASCII bytes to ASCII bytes, like UTF-8 does).
  • The server can optionally send an additional authentication parameter charset="UTF-8" in its challenge, like this:
    WWW-Authenticate: Basic realm="myChosenRealm", charset="UTF-8"
    This announces that the server will accept non-ASCII characters in username / password, and that it expects them to be encoded in UTF-8 (specifically Normalization Form C). Note that only UTF-8 is allowed.

Complete version:

Read the spec. It contains additional details, such as the exact encoding procedure, and the list of Unicode codepoints that should be supported.

Browser support

As of 2018, modern browsers will usually default to UTF-8 if a user enters non-ASCII characters for username or password (even if the server does not use the charset parameter).

  • Chrome also appears to use UTF-8
  • Internet Explorer does not use UTF-8 (issue #11879588 )
  • Firefox is experimenting with a change currently planned for v59 (bug 1419658)

Realm

The realm parameter still only supports ASCII characters even in RFC 7617.

Community
  • 1
  • 1
Julian Reschke
  • 40,156
  • 8
  • 95
  • 98
  • Thanks Julian. I had run into that proposal but seems to have expired and not gone anywhere further. Too bad :-(. – Dobes Vandermeer Sep 01 '11 at 04:40
  • 1
    Your answer must be the best. I can paraphrase it as ASCII for sure, maybe ISO-8859-1 if you are lucky. – Dobes Vandermeer Sep 02 '11 at 13:34
  • It looks like the [latest version 04 of the proposal](http://tools.ietf.org/html/draft-reschke-basicauth-enc-04) (which coincidentally seems to be published today) expires on august 1, 2012. – Michiel van Oosterhout Jan 29 '12 at 21:33
  • The answer was obsolete, as it did not mention RFC 7617. I edited to include this. Julian: Hope you don't mind. – sleske Jan 19 '18 at 14:01
  • Oops - I just realized you are actually the author of RFC 7617. Now I really hope I did not mis-edit something. – sleske Jan 19 '18 at 14:25
  • RFC 7617 says: "*The 'realm' parameter carries data that can be considered textual; however, [RFC7235] does not define a way to reliably transport non- US-ASCII characters. This is a known issue that would need to be addressed in a revision to that specification.*" But the `realm` is a `quoted-string`, and the definition of `quoted-string` in RFC 7230 used by 7235 allows for octets up to 0xFF, so one would think UTF-8 can be used. – Remy Lebeau Feb 27 '19 at 17:17
  • @RemyLebeau - nope, non-ASCII characters in quoted-string are discouraged and have no agreed upon character encoding – Julian Reschke Feb 28 '19 at 09:00
  • @JulianReschke yes I realize that – Remy Lebeau Feb 28 '19 at 16:24
  • 1
    I'm reading a contradiction in this answer w.r.t RFC 7617: it says "the new RFC explicitly defines the character encoding to be used for username and password" but then it says "The default encoding is still undefined" - which means _it isn't_ explicitly defined... – Dai Aug 15 '21 at 17:40
  • Updated by [RFC 9110](https://www.rfc-editor.org/rfc/rfc9110#name-establishing-a-protection-s). The values of `auth-param`, including `realm`, are still within the ASCII range. – Константин Ван Dec 02 '22 at 18:11
40

Short answer: iso-8859-1 unless encoded-words are used in accordance with RFC2047 (MIME).

Longer explanation:

RFC2617, section 2 (HTTP Authentication) defines basic-credentials:

basic-credentials = base64-user-pass
base64-user-pass  = <base64 encoding of user-pass, 
                     except not limited to 76 char/line>
user-pass         = userid ":" password
userid            = *<TEXT excluding ":">
password          = *TEXT

The spec should not be read without referring to RFC2616 (HTTP 1.1) for definitions in BNF (like the one above):

This specification is a companion to the HTTP/1.1 specification 2. It uses the augmented BNF section 2.1 of that document, and relies on both the non-terminals defined in that document and other aspects of the HTTP/1.1 specification.

RFC2616, section 2.1 defines TEXT (emphasis mine):

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047.

TEXT           = <any OCTET except CTLs, but including LWS>

So it's definitely iso-8859-1 unless you detect some other encoding according to RFC2047 (MIME pt. 3) rules:

// Username: Mike
// Password T€ST
Mike:=?iso-8859-15?q?T€ST?=

In this case the euro sign in the word would be encoded as 0xA4 according to iso-8859-15. It is my understanding that you should check for these encoded word delimiters, and then decode the words inside based on the specified encoding. If you don't, you will think the password is =?iso-8859-15?q?T¤ST?= (notice that 0xA4 would be decoded to ¤ when interpreted as iso-8859-1).

This is my understanding, I can't find more explicit confirmation than these RFCs. And some of it seems contradictory. For example, one of the 4 stated goals of RFC2047 (MIME, pt. 3) is to redefine:

the format of messages to allow for ... textual header information in character sets other than US-ASCII.

But then RFC2616 (HTTP 1.1) defines a header using the TEXT rule which defaults to iso-8859-1. Does that mean that every word in this header should be an encoded-word (i.e. the =?...?= form)?

Also relevant, no current browser does this. They use utf-8 (Chrome, Opera), iso-8859-1 (Safari), the system code page (IE) or something else (like only the most significant bit from utf-8 in the case of Firefox).

Edit: I just realized this answer looks at the issue more from the server-side perspective.

Community
  • 1
  • 1
Michiel van Oosterhout
  • 22,839
  • 15
  • 90
  • 132
  • RFC 2047 encoding doesn't apply in this case. – Julian Reschke Jan 30 '12 at 10:14
  • @JulianReschke Well, the spec clearly states "only when encoded according to the rules of RFC 2047". I understand the rules in RFC2047 may not be applicable to HTTP headers, but the spec is pretty clear in referring to it. I have added the fact that no browser actually does this. – Michiel van Oosterhout Jan 30 '12 at 11:40
  • 4
    the HTTPbis specs will not mention RFC 2047 anymore. – Julian Reschke Jan 30 '12 at 16:49
  • Very detailed write-up, thanks @MichielvanOosterhout! – ToastyMallows Sep 01 '16 at 21:16
  • **RFC 7617 updated** the definitions of the `user-id` and `password`. It does not allow `LWS` (linear whitespace) in them anymore. **All control characters are forbidden in them.** “_The `user-id` and `password` MUST NOT contain any control characters (see ‘`CTL`’ in Appendix B.1 of [RFC5234])._” – Константин Ван Dec 02 '22 at 18:18
4

If you are interested in what browsers do when you enter non-ascii characters at the login prompt, I just tried with Firefox.

It seems to lazily convert everithing to ISO-8859-1 by taking the least significant byte of each unicode value, e.g.:

User: 豚 (\u8c5a)
Password: 虎 (\u864e)

Are encoded the same as:

User: Z (\u005a)
Password: N (\u004e)

0x5a 0x3a 0x4e base64-> WjpO

4

RFCs aside, in Spring framework, the BasicAuthenticationFilter class, the default is UTF-8.

The reason for this choice I believe is that UTF-8 is capable of encoding all possible characters, while ISO-8859-1 (or ASCII) is not. Trying to use username/password with characters not supported in the system can lead to broken behaviour or (maybe worse) degraded security.

holmis83
  • 15,922
  • 5
  • 82
  • 83
  • 1
    Well, using UTF-8 doesn't help if the other side doesn't know about it. So it would be good if the Spring framework implemented the charset parameter described in – Julian Reschke Feb 13 '17 at 11:39
  • 1
    @JulianReschke I informed how it is implemented in one of the most common frameworks and a likely reason for it. Don't shoot the messenger! – holmis83 Feb 14 '17 at 09:27