93

I'm trying to remove the non-URL part of a big string. Most of the regexes I found are like [A-Za-z0-9-_.!~*'()], but there are more things that can a url contain. Like http://127.0.0.1:8080/test?v=123#this for example

So what are the latest characters for a valid URL?

Zoe
  • 27,060
  • 21
  • 118
  • 148
blez
  • 4,939
  • 5
  • 50
  • 82
  • 5
    Have you looked at the RFC? http://www.faqs.org/rfcs/rfc1738.html – ale Aug 18 '11 at 14:36
  • There's what's technically a valid URL and what's actually used as a URL today. Only 25% of the internet is even written in English. #2 and #4 languages are Chinese and Arabic. This answer to another question sums it up nicely: https://stackoverflow.com/a/36667242/1128668 – GlenPeterson Oct 13 '21 at 01:03

1 Answers1

160

All the gory details can be found in the current RFC on the topic: RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax)

Based on this related answer, you are looking at a list that looks like: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], @, !, $, &, ', (, ), *, +, ,, ;, %, and =. Everything else must be url-encoded. Also, some of these characters can only exist in very specific spots in a URI and outside of those spots must be url-encoded (e.g. % can only be used in conjunction with url encoding as in %20), the RFC has all of these specifics.

ckittel
  • 6,478
  • 3
  • 41
  • 71
  • 10
    Note: this list doesn't include the percent sign – thomasrutter Aug 18 '15 at 04:52
  • 10
    That is correct @thomasrutter, a % is used for url-encoding. A % needs to be represented as %25 to be used in a URI. From the RFC: Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. – ckittel Aug 18 '15 at 20:38
  • 12
    Just to mention that some of those ('/','?','#','&','+') while valid, serve particular functionality in a URL with query component and are not treated as just regular chars – kofifus Jan 18 '16 at 05:26
  • 32
    `http://example.com/hello%20world` is a valid URL, therefore the character `%` is valid in a URL and should be in the list. – Martin Jambon Aug 01 '16 at 23:31
  • @MartinJambon FYI, I gave a response on that exact comment above already. – ckittel Aug 03 '16 at 19:41
  • 20
    @ckittel your response is at best ambiguous. Would you like to clarify what you think is correct? The question is what characters are valid in a URL. It's not asking which characters need to be escaped. Other characters than `%`, such as `/`, have a special meaning and need to be escaped for them to be part e.g. of path component data; but it's not the question. – Martin Jambon Aug 03 '16 at 21:00
  • 1
    A shorter summary (if I didn't make any mistakes): Valid values are the range '!' to 'z', except double quote ", backslash \, caret ^, less than <, greater than >, and backtick `. – Bryce Wagner Sep 09 '16 at 14:58
  • @ckittel, what does the symbol ';' mean in a URL? – Artanis Zeratul Aug 17 '20 at 01:16
  • @ArtanisZeratul - nothing different than any other character. For example, I might do https://example.org/search?facets=cat1:red;cat2:xl -- really it would be up the the application to decide what meaning to make from that character. – ckittel Aug 20 '20 at 19:57
  • ah ok. I thought it is something like the ? and & where you pass parameters together with the URL. – Artanis Zeratul Aug 21 '20 at 02:06
  • 1
    `/[A-Za-z0-9-._~:/?#\[\]@!$&'()*+,;%=]+/g` – yeah22 Apr 23 '22 at 05:30
  • @yeah22, you can skip the escape before the first "[". Updated version: `/[A-Za-z0-9-._~:/?#[\]@!$&'()*+,;%=]+/g` – Mohammad Kurjieh Aug 24 '23 at 22:32