22

How does one go about authoring a Regular Expression that matches against all strings that are valid URIs, while failing to match against all strings that are invalid URIs?

To be specific about what I am referencing when I say URI, I have added a link below for the most current URI RFC standard. It defines the entity that I want to validate using a regular expression.

I don't need it to be able to parse the URI. I just need a regular expression for validating.

The .Net Regular Expression Format is preferred. (.Net V1.1)


My Current Solution:

^([a-zA-Z0-9+.-]+):(//([a-zA-Z0-9-._~!$&'()*+,;=:]*)@)?([a-zA-Z0-9-._~!$&'()*+,;=]+)(:(\\d*))?(/?[a-zA-Z0-9-._~!$&'()*+,;=:/]+)?(\\?[a-zA-Z0-9-._~!$&'()*+,;=:/?@]+)?(#[a-zA-Z0-9-._~!$&'()*+,;=:/?@]+)?$(:(\\d*))?(/?[a-zA-Z0-9-._~!$&'()*+,;=:/]+)?(\?[a-zA-Z0-9-._~!$&'()*+,;=:/?@]+)?(\#[a-zA-Z0-9-._~!$&'()*+,;=:/?@]+)?$
JΛYDΞV
  • 8,532
  • 3
  • 51
  • 77
alumb
  • 4,401
  • 8
  • 42
  • 52

6 Answers6

29

Does Uri.IsWellFormedUriString work for you?

bdukes
  • 152,002
  • 23
  • 148
  • 175
  • 2
    +1 This is the only correct answer. Regex is not the correct tool for the job. I always wonder how standards compliant and secure parsing an Uri with regex can be. Dealing with internationalized (unicode) domains? Encodings that obfuscate the true path? Fault-tolerant? Tested? Just use the .net framework! – oɔɯǝɹ Apr 27 '14 at 15:35
19

The URI specification says:

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

(I guess that's the same regex as in the STD66 link given in another answer.)

But breaking-down is not validating. To correctly validate a URI, one would have to translate the BNF for URIs to a regex. While some BNFs cannot be expressed as regular expressions, I think with this one it could be done. But it shouldn't be done - it would be a huge mess. It's better to use a library function.

Community
  • 1
  • 1
12

This site looks promising: http://snipplr.com/view/6889/regular-expressions-for-uri-validationparsing/

They propose following regex:

/^([a-z0-9+.-]+):(?://(?:((?:[a-z0-9-._~!$&'()*+,;=:]|%[0-9A-F]{2})*)@)?((?:[a-z0-9-._~!$&'()*+,;=]|%[0-9A-F]{2})*)(?::(\d*))?(/(?:[a-z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?|(/?(?:[a-z0-9-._~!$&'()*+,;=:@]|%[0-9A-F]{2})+(?:[a-z0-9-._~!$&'()*+,;=:@/]|%[0-9A-F]{2})*)?)(?:\?((?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*))?(?:#((?:[a-z0-9-._~!$&'()*+,;=:/?@]|%[0-9A-F]{2})*))?$/i
Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
  • I used this trick in PHP. For PHP coders, see here: https://stackoverflow.com/questions/206059/php-validation-regex-for-url – Xavi Montero Dec 13 '16 at 23:38
  • Doesn't work for example "ldap://[2001:db8::7]/c=GB?objectClass?one" – ms_devel Feb 27 '20 at 14:27
  • 1
    @msavara right. it's probably best to use bdukes answer! – Daren Thomas Mar 02 '20 at 13:27
  • 1
    Scheme names, AFAIK, cannot start with `+`, `-`, or `.`, but only with an alphabetic character. See [RFC3986 (section 3.1)](https://datatracker.ietf.org/doc/html/rfc3986#section-3). – Gwyneth Llewelyn Feb 01 '22 at 09:49
  • That one is pretty good, but it misses urls with `email:password`. e.g `scheme://user@example.com:pass@exaple.com/`. See @Lostfields response if you need to catch those. – HuBeZa Feb 09 '22 at 10:08
  • 1
    @HuBeZa we probably shouldn't be using regexes for this anyway. This was a long time ago. I'd use the library function bdukes mentioned. – Daren Thomas Feb 10 '22 at 15:55
9

The best regex I came up with according to RFC 3986 (https://www.rfc-editor.org/rfc/rfc3986) was the following:

Flow diagram of regex using https://regexper.com

// named groups
/^(?<scheme>[a-z][a-z0-9+.-]+):(?<authority>\/\/(?<user>[^@]+@)?(?<host>[a-z0-9.\-_~]+)(?<port>:\d+)?)?(?<path>(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])*)*|(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+)*)?(?<query>\?(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?(?<fragment>\#(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?$/i

// unnamed groups
/^([a-z][a-z0-9+.-]+):(\/\/([^@]+@)?([a-z0-9.\-_~]+)(:\d+)?)?((?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])*)*|(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+)*)?(\?(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?(\#(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?$/i

capture groups

  1. scheme
  2. authority
  3. userinfo
  4. host
  5. port
  6. path
  7. query
  8. fragment
Community
  • 1
  • 1
Lostfields
  • 1,364
  • 1
  • 12
  • 20
8

The best and most definitive guide to this I have found is here: http://jmrware.com/articles/2009/uri_regexp/URI_regex.html (In answer to your question, see the URI table entry)

All of these rules from RFC3986 are reproduced in Table 2 along with a regular expression implementation for each rule.

A javascript implementation of this is available here: https://github.com/jhermsmeier/uri.regex

For reference, the URI regex is repeated below:

# RFC-3986 URI component:  URI
[A-Za-z][A-Za-z0-9+\-.]* :                                      # scheme ":"
(?: //                                                          # hier-part
  (?: (?:[A-Za-z0-9\-._~!$&'()*+,;=:]|%[0-9A-Fa-f]{2})* @)?
  (?:
    \[
    (?:
      (?:
        (?:                                                    (?:[0-9A-Fa-f]{1,4}:)    {6}
        |                                                   :: (?:[0-9A-Fa-f]{1,4}:)    {5}
        | (?:                            [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:)    {4}
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,1} [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:)    {3}
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,2} [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:)    {2}
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,3} [0-9A-Fa-f]{1,4})? ::    [0-9A-Fa-f]{1,4}:
        | (?: (?:[0-9A-Fa-f]{1,4}:){0,4} [0-9A-Fa-f]{1,4})? ::
        ) (?:
            [0-9A-Fa-f]{1,4} : [0-9A-Fa-f]{1,4}
          | (?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) \.){3}
                (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
          )
      |   (?: (?:[0-9A-Fa-f]{1,4}:){0,5} [0-9A-Fa-f]{1,4})? ::    [0-9A-Fa-f]{1,4}
      |   (?: (?:[0-9A-Fa-f]{1,4}:){0,6} [0-9A-Fa-f]{1,4})? ::
      )
    | [Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&'()*+,;=:]+
    )
    \]
  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
  | (?:[A-Za-z0-9\-._~!$&'()*+,;=]|%[0-9A-Fa-f]{2})*
  )
  (?: : [0-9]* )?
  (?:/ (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})* )*
| /
  (?:    (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})+
    (?:/ (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})* )*
  )?
|        (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})+
    (?:/ (?:[A-Za-z0-9\-._~!$&'()*+,;=:@]|%[0-9A-Fa-f]{2})* )*
|
)
(?:\? (?:[A-Za-z0-9\-._~!$&'()*+,;=:@/?]|%[0-9A-Fa-f]{2})* )?   # [ "?" query ]
(?:\# (?:[A-Za-z0-9\-._~!$&'()*+,;=:@/?]|%[0-9A-Fa-f]{2})* )?   # [ "#" fragment ]
papercowboy
  • 3,369
  • 2
  • 28
  • 32
  • 14
    sweet mother of all regex, and my co-workers think I'm nuts for using and understanding regex... – Russ Nov 04 '14 at 18:17
  • That's absolutely insane! The best example ever to show that, although regexs _are_ useful in a lot of cases, they are _not_ to be used for _everything_. Hammers and nails, screwdrivers and screws. – Gwyneth Llewelyn Feb 01 '22 at 09:40
1

Are there some specific URIs you care about or are you trying to find a single regex that validates STD66?

I was going to point you to this regex for parsing a URI. You could then, in theory, check to see if all of the elements you care about are there.

But I think bdukes answer is better.

Community
  • 1
  • 1
Mark Biek
  • 146,731
  • 54
  • 156
  • 201