0

H have a regular expression that matches website urls

.+\.\w\w.*(.*)

I would like to extract the url that matches my string for example:

what is google.com?

when i run my code

var x = /.+\.\w\w.*(.*)/
x.exec( "what is <http://google.com>?" )

it instead returns

["what is http://google.com?", ""]

instead of just returning the url that i want it to match, why is this happening?

deathknight256
  • 315
  • 2
  • 4
  • 13
  • Use a regexp testing site such as regex101 to test your expressions. –  Jun 18 '16 at 06:27

2 Answers2

0

This is because your regex does not really match URLs, but in fact a lot more.

For some inspiration on how to match URLs, you could have a look at the proposal from this StackOverflow answer:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)
Community
  • 1
  • 1
TimoStaudinger
  • 41,396
  • 16
  • 88
  • 94
  • ah so thats why, i tested out your regex and it works out perfectly when i try exec command. i have one more question though, i have an existing regex that is using my old similar pattern to match a sentence. /(is\s*(.+\.\w\w.*)\sdown?)/ how do I integrate my old url regex pattern to yours? – deathknight256 Jun 18 '16 at 01:57
  • I tried /(is\s*([-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*))\sdown[?])/ – deathknight256 Jun 18 '16 at 02:15
0

Description

In your expression the . is grabbing any character and the + or * makes the capture greedy. The net effect is that all characters are captured.

([-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})\b([-a-zA-Z0-9@:%_\+.~#?&\/=]*)

Regular expression visualization

This regular expression will do the following:

  • Finds strings that resemble urls
  • ignores any leading http or https
  • splits the query substring from the URL

Example

Live Demo

https://regex101.com/r/kB1mS6/3

Sample text

what is <http://google.com>?
what is www.ibm.com?
are these the Droids.I.com?Lookingfor=Yes

Sample Matches

  • Capture group 0 gets the url and query string if it exists
  • capture group 1 gets the url
  • Capture group 2 gets the query string if it exists
MATCH 1
1.  [16-26] `google.com`
2.  [26-26] ``

MATCH 2
1.  [37-48] `www.ibm.com`
2.  [48-49] `?`

MATCH 3
1.  [64-76] `Droids.I.com`
2.  [76-91] `?Lookingfor=Yes`

To further capture additional words in the sentence you can modify the expression:

([-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})\b([-a-zA-Z0-9@:%_\+.~#?&\/=]*)(?:>?\s+(down))?

Regular expression visualization

Examples

Live Demo

https://regex101.com/r/kB1mS6/4

Sample Text

what is <http://google.com> down?
what is www.ibm.com?
are these the Droids.I.com?Lookingfor=Yes
why is http://www.bing.com down?
why is www.bing.com down?

Sample Matches

MATCH 1
1.  `google.com`
2.  ``
3.  `down`

MATCH 2
1.  `www.ibm.com`
2.  `?`

MATCH 3
1.  `Droids.I.com`
2.  `?Lookingfor=Yes`

MATCH 4
1.  `www.bing.com`
2.  ``
3.  `down`

MATCH 5
1.  `www.bing.com`
2.  ``
3.  `down`

This slightly modifies the expression from https://stackoverflow.com/a/3809435/3836229 to separately capture the URL.

Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43