88

Are there any other characters except A-Za-z0-9 that can be used to shorten links without getting into trouble? :)

I was thinking about +,;- or something.

Is there a defined standard regarding what characters can be used in a URL that browser vendors respect?

d-_-b
  • 21,536
  • 40
  • 150
  • 256
Florian F
  • 4,044
  • 3
  • 30
  • 31

2 Answers2

138

A path segment (the parts in a path separated by /) in an absolute URI path can contain zero or more of pchar that is defined as follows:

  pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"
  pct-encoded = "%" HEXDIG HEXDIG
  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

So it’s basically AZ, az, 09, -, ., _, ~, !, $, &, ', (, ), *, +, ,, ;, =, :, @, as well as % that must be followed by two hexadecimal digits. Any other character/byte needs to be encoded using the percent-encoding.

Although these are 79 characters in total that can be used in a path segment literally, some user agents do encode some of these characters as well (e.g. %7E instead of ~). That’s why many use just the 62 alphanumeric characters (i.e. AZ, az, 09) or the Base 64 Encoding with URL and Filename Safe Alphabet (i.e. AZ, az, 09, -, _).

Steffo
  • 305
  • 5
  • 13
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • @Joey: Not in a path segment as it’s the path segment delimiter. – Gumbo Jan 12 '11 at 14:19
  • 1
    Ok, I was kinda assuming the OP was talking about the whole path of an URI, not only a single segment. At least, URI shorteners usually work in the way of `http://domain.foo/` where it doesn't need to be restricted to a single sement. – Joey Jan 12 '11 at 14:24
  • So, it means that path part of URI can contain `&`, right? But this symbol is usually used as parameter delimiters in query part of URI. – 23W Apr 21 '21 at 14:20
  • 1
    @23W The query part of a URL *must* be introduced with `?`. Therefore, there is no ambiguity between `&` in the path and `&` in the query string. – ErikE Jun 26 '23 at 23:51
46

According to RFC 3986 the valid characters for the path component are:

a-z A-Z 0-9 . - _ ~ ! $ & ' ( ) * + , ; = : @

as well as percent-encoded characters and of course, the slash /.

Keep in mind, though, that many applications (not necessarily browsers) that attempt to parse URIs to make them clickable, for example, may support a much smaller set of characters. This is akin to parsing e-mail addresses where most attempts also don't catch all addresses allowed by the standard.

Steffo
  • 305
  • 5
  • 13
Joey
  • 344,408
  • 85
  • 689
  • 683
  • Sorry - where are you referencing this, specifically in the spec? https://tools.ietf.org/html/rfc3986#page-22 - I don't see any call-outs for character constraints on the path or segments. – Jmoney38 Jul 03 '19 at 21:06
  • 1
    @Jmoney38: See the definition of `pchar`. – Joey Jul 04 '19 at 19:36