250

Does anyone know the full list of characters that can be used within a GET without being encoded? At the moment I am using A-Z a-z and 0-9... but I am looking to find out the full list.

I am also interested into if there is a specification released for the up coming addition of Chinese, Arabic url's (as obviously that will have a big impact on my question)

Mark
  • 5,423
  • 11
  • 47
  • 62
  • 9
    The characters allowed in a URI are either reserved `!*'();:@&=+$,/?#[]` or unreserved `A-Za-z0-9_.~-` (or a percent character `%` as part of a percent-encoding) – Mikl May 30 '16 at 16:42
  • 2
    In MySQL i use this `REGEXP '[^]A-Za-z0-9_.~!*''();:@&=+$,/?#[%-]+'` to find URL string with bad characters. Maybe it’s useful for someone else, too. – Mikl May 30 '16 at 16:47
  • 1
    @Mikl: That thing hardly looks like a regular expression. – Jens Mander Nov 19 '19 at 18:31

10 Answers10

211

EDIT: As @Jukka K. Korpela correctly points out, RFC 1738 was updated by RFC 3986. This has expanded and clarified the characters valid for host, unfortunately it's not easily copied and pasted, but I'll do my best.

In first matched order:

host        = IP-literal / IPv4address / reg-name

IP-literal  = "[" ( IPv6address / IPvFuture  ) "]"

IPvFuture   = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

IPv6address =         6( h16 ":" ) ls32
                  /                       "::" 5( h16 ":" ) ls32
                  / [               h16 ] "::" 4( h16 ":" ) ls32
                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                  / [ *4( h16 ":" ) h16 ] "::"              ls32
                  / [ *5( h16 ":" ) h16 ] "::"              h16
                  / [ *6( h16 ":" ) h16 ] "::"

ls32        = ( h16 ":" h16 ) / IPv4address
                  ; least-significant 32 bits of address

h16         = 1*4HEXDIG 
               ; 16 bits of address represented in hexadecimal

IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

dec-octet   = DIGIT                 ; 0-9
              / %x31-39 DIGIT         ; 10-99
              / "1" 2DIGIT            ; 100-199
              / "2" %x30-34 DIGIT     ; 200-249
              / "25" %x30-35          ; 250-255

reg-name    = *( unreserved / pct-encoded / sub-delims )

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"     <---This seems like a practical shortcut, most closely resembling original answer

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

pct-encoded = "%" HEXDIG HEXDIG

Original answer from RFC 1738 specification:

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

^ obsolete since 1998.

Alex R
  • 11,364
  • 15
  • 100
  • 180
Myles
  • 20,860
  • 4
  • 28
  • 37
  • 6
    @Tim slash is a reserved character, therefore, if it is being used for its reserved purpose (delineating paths, protocol delineation...), then it does not need escaping. Otherwise, it does. – Myles Jul 06 '12 at 22:26
  • 4
    Generic syntax rules of RFC 1738 were obsoleted in 1998. – Jukka K. Korpela Mar 08 '13 at 07:17
  • 1
    @JukkaK.Korpela Do you have the correct RFC to refer to then so this answer can be improved? – Myles Mar 08 '13 at 14:56
  • 3
    @Myles, STD 66 (= RFC 3986) is mentioned in other answers. Whether the content of answers is correct is a different issue; I don’t think any of the answers correctly describes the full list. – Jukka K. Korpela Mar 08 '13 at 15:05
  • This was useful to me when I was poking around in a similar area - the reference for character encodings http://www.w3schools.com/tags/ref_urlencode.asp – JonnyRaa Dec 16 '14 at 15:12
  • Only American English alphanumerics, I presume. – 2540625 Nov 22 '15 at 15:50
  • @janaspage that is true, but if non-ASCII is necessary, then you can URL-encode the binary values (as hex) and do your own decoding on the otherside. – Myles Nov 28 '15 at 18:31
  • I think, you should add % character in the beginning of this answer. Because `The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding)` – Mikl May 30 '16 at 16:08
  • 7
    And you can add list of unreserved `A-Za-z0-9_.-~` and reserved characters in the beginning of this answer. `!*'();:@&=+$,/?#[]` It can save time for people – Mikl May 30 '16 at 16:20
  • This answer is confusing... Simple question: Can I use an `@` in the URL? – basZero Aug 31 '16 at 15:13
  • 3
    @basZero I'm sorry you found it confusing, but the full answer is not simple. The answer to your question is no, as it is a reserved character as stated by : `reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"` – Myles Aug 31 '16 at 22:13
  • 2
    @basZero Actually, that's not correct. "Reserved" does not mean "cannot be used" -- slash and colon are also reserved, but they are present in almost all URLs! Rather, "reserved" means the character has special significance in some portions of the URL, so it cannot be used arbitrarily. AFAICT, `@` is allowed as a delimiter in the "authority" component (as in `http://fred@example.com/index.html`) and in path segments (a canonical example being `mailto:fred@example.com`, but it looks like `http://example.com/p@th/d@t@.html` should be just as valid). – jcl Dec 01 '16 at 00:48
  • Is `https://web.archive.org/web/20211016000003/http://www.google.com/` an example that violates the standard (note the second `://`), even though is still [works](https://web.archive.org/web/20211016000003/http://www.google.com/) in practice? I find this example interesting because the second part could be viewed as either "just a payload (that should not be parsed and should not use reserved characters)" or as "intentional and knowledgeable use, with a good understanding that this is actually valid and won't break anytime soon". – Igor Oct 23 '21 at 22:50
  • The gen-delims and sub-delims which are not involved in separating the path components from other components, namely, `[ ] ! $ ( ) * + , @` and cautiously ` : =` are all valid for *delimiting* outside the host and protocol. The spec was rewritten to clarify that authors of URL **dereferencing algorithms** need characters apart from resource names to create semantics with. Generally this question comes up because a web developer wishes to create semantic URLs for some application or library. This is not only the intended reserved use, but *why* the standard was clarified to begin with. – That Realty Programmer Guy Oct 09 '22 at 08:49
  • 1
    Side note, the `;` character delimits sub-components in the path. This is also generally for semantics like the above characters, but it specifically creates a sub path at the same level. Eg `example.com/food/bakery/toast` and `example.com/food/bakery;22/toast` should represent the same resource, though it may potentially respond to the presence of other data. Similarly `example.com/food/bakery;22/toast` and `example.com/food/bakery;22;topping:butter/toast` should really pass through the same resource & sub-resource `bakery;22`, though it is left for you to decide if that's the case. – That Realty Programmer Guy Oct 09 '22 at 08:56
46

The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding)

http://en.wikipedia.org/wiki/Percent-encoding#Types_of_URI_characters

says these are RFC 3986 unreserved characters (sec. 2.3) as well as reserved characters (sec 2.2) if they need to retain their special meaning. And also a percent character as part of a percent-encoding.

Community
  • 1
  • 1
Amber
  • 507,862
  • 82
  • 626
  • 550
  • @j.a.estevan Citation from the linked document: `The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding)` – Mikl May 30 '16 at 16:05
37

The full list of the 66 unreserved characters is in RFC3986, here: https://www.rfc-editor.org/rfc/rfc3986#section-2.3

This is any character in the following regex set:

[A-Za-z0-9_.\-~]
Community
  • 1
  • 1
slacy
  • 11,397
  • 8
  • 56
  • 61
  • 3
    You can use those reserved too. – Qwerty Mar 21 '13 at 11:53
  • 1
    The obsolete RFC1738 listed `{}^\~` and `backtick` as unsafe. And RFC3986 lists \ as unsafe because of the file system. This means `{}^` could be used as well. – mgutt Feb 16 '17 at 15:22
  • 1
    So if you're trying to, say, find the end of a **url within a string** (which I am), it would be best to go by the obsolete standards in the [accepted answer](https://stackoverflow.com/a/1856809/8112776)... If you're **validating url's** you should use the set of characters on *this* answer. – ashleedawg Jul 14 '18 at 10:17
  • 3
    Careful, you've written this as a regular expression character class. Make sure to escape the `-` or put it at the beginning or end of the character class, because `[.-~]` actually contains all ASCII characters from 46 to 126. – kwl Jan 24 '19 at 07:25
26

I tested it by requesting my website (apache) with all available chars on my german keyboard as URL parameter:

http://example.com/?^1234567890ß´qwertzuiopü+asdfghjklöä#<yxcvbnm,.-°!"§$%&/()=? `QWERTZUIOPÜ*ASDFGHJKLÖÄ\'>YXCVBNM;:_²³{[]}\|µ@€~

These were not encoded:

^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ,.-!/()=?`*;:_{}[]\|~

Not encoded after urlencode():

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_

Not encoded after rawurlencode():

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_~

Note: Before PHP 5.3.0 rawurlencode() encoded ~ because of RFC 1738. But this was replaced by RFC 3986 so its safe to use, now. But I do not understand why for example {} are encoded through rawurlencode() because they are not mentioned in RFC 3986.

An additional test I made was regarding auto-linking in mail texts. I tested Mozilla Thunderbird, aol.com, outlook.com, gmail.com, gmx.de and yahoo.de and they fully linked URLs containing these chars:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_~+#,%&=*;:@

Of course the ? was linked, too, but only if it was used once.

Some people would now suggest to use only the rawurlencode() chars, but did you ever hear that someone had problems to open these websites?

Asterisk
http://wayback.archive.org/web/*/http://google.com

Colon
https://en.wikipedia.org/wiki/Wikipedia:About

Plus
https://plus.google.com/+google

At sign, Colon, Comma and Exclamation mark
https://www.google.com/maps/place/USA/@36.2218457,...

Because of that these chars should be usable unencoded without problems. Of course you should not use &; because of encoding sequences like &amp;. The same reason is valid for % as it used to encode chars in general. And = as it assigns a value to a parameter name.

Finally I would say its ok to use these unencoded:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-_~!+,*:@

But if you expect randomly generated URLs you should not use punctuation marks like .!, because some mail apps will not auto-link them:

http://example.com/?foo=bar! < last char not linked

mgutt
  • 5,867
  • 2
  • 50
  • 77
  • Practical approach - good job. Was looking for that last list of yours - the `+` sign especially :-D – Oliver Mar 22 '19 at 15:59
  • The "+" character is used as an encoded space character in the query. This is for historical reasons used with form encoded requests. While you can use the "+" character, you can also find using it will lead to unexpected results in some cases. – Rich Remer Jan 05 '23 at 22:33
13

From here

Thus, only alphanumerics, the special characters $-_.+!*'(), and reserved characters used for their reserved purposes may be used unencoded within a URL.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
AdaTheDev
  • 142,592
  • 28
  • 206
  • 200
13

RFC3986 defines two sets of characters you can use in a URI:

  • Reserved Characters: :/?#[]@!$&'()*+,;=

    reserved = gen-delims / sub-delims

    gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

    sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

    The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent.

  • Unreserved Characters: A-Za-z0-9-_.~

    unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

    Characters that are allowed in a URI but do not have a reserved purpose are called unreserved.

Cyker
  • 9,946
  • 8
  • 65
  • 93
7

These are listed in RFC3986. See the Collected ABNF for URI to see what is allowed where and the regex for parsing/validation.

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
6

This answer discusses characters may be included inside a URL fragment part without being escaped. I'm posting a separate answer since this part is slightly different than (and can be used in conjunction with) other excellent answers here.

The fragment part is not sent to the server and it is the characters that go after # in this example:

https://example.com/#STUFF-HERE

Specification

The relevant specifications in RFC 3986 are:

  fragment    = *( pchar / "/" / "?" )
  pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"
  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

This also references rules in RFC 2234

  ALPHA       =  %x41-5A / %x61-7A   ; A-Z / a-z
  DIGIT       =  %x30-39             ; 0-9

Result

So the full list, excluding escapes (pct-encoded) are:

A-Z a-z 0-9 - . _ ~ ! $ & ' ( ) * + , ; = : @ / ?

For your convenience here is a PCRE expression that matches a valid, unescaped fragment:

/^[A-Za-z0-9\-._~!$&'()*+,;=:@\/?]*$/

Encoding

Counting this up, there are:

26 + 26 + 10 + 19 = 81 code points

You could use base 81 to efficiently encode data here.

Community
  • 1
  • 1
William Entriken
  • 37,208
  • 23
  • 149
  • 195
3

The upcoming change is for chinese, arabic domain names not URIs. The internationalised URIs are called IRIs and are defined in RFC 3987. However, having said that I'd recommend not doing this yourself but relying on an existing, tested library since there are lots of choices of URI encoding/decoding and what are considered safe by specification, versus what are safe by actual use (browsers).

dajobe
  • 4,938
  • 35
  • 41
0

If you like to give a special kind of experience to the users you could use pushState to bring a wide range of characters to the browser's url:

enter image description here

var u="";var tt=168;
for(var i=0; i< 250;i++){
 var x = i+250*tt;
console.log(x);
 var c = String.fromCharCode(x);
 u+=c; 
}
history.pushState({},"",250*tt+u);
Grim
  • 1,938
  • 10
  • 56
  • 123