1

Not sure if an URL (https://www.rfc-editor.org/rfc/rfc3986) is regexp-expressible but what would be the most robust and formal regular expression of an URL?

There are many regexp dialects (perl, emacs lisp, php, python, etc), but any dialect is acceptable.

Community
  • 1
  • 1
OTZ
  • 3,003
  • 4
  • 29
  • 41
  • There's a regex in Appendix B in the RFC. Is that what you want? – kennytm Oct 01 '13 at 14:53
  • 1
    @KennyTM That regular expression is only useful for dissecting a URL; it won’t help much for finding a URL in text. – Gumbo Oct 01 '13 at 15:06
  • What are you trying to do? Find URLs in a block of text? Or validate a URL that's been given you as being valid? – Andy Lester Oct 01 '13 at 16:46

1 Answers1

0
  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression <n> as $<n>. For example, matching the above expression to

  http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

  $1 = http:
  $2 = http
  $3 = //www.ics.uci.edu
  $4 = www.ics.uci.edu
  $5 = /pub/ietf/uri/
  $6 = <undefined>
  $7 = <undefined>
  $8 = #Related
  $9 = Related

where indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the five components as

  scheme    = $2
  authority = $4
  path      = $5
  query     = $7
  fragment  = $9

via https://www.rfc-editor.org/rfc/rfc3986#appendix-B

Community
  • 1
  • 1
19greg96
  • 2,592
  • 5
  • 41
  • 55
  • I understand it is written in appendix-B. But it is overtly broad and not robust enough. For example, the regexp also captures 'http://www.ics.uci.edu/pub/ietf/uri/# Related', 'http://www.ics.uci.edu/pub/ietf/uri /#Related', etc. – OTZ Oct 01 '13 at 15:29
  • Then how do you define `URL`? – 19greg96 Oct 01 '13 at 15:37