0

I'm using javascript regex to do the following:

I have the html content of a page saved inside a string, and I want to match all URLs on the page.

For example, if the document contains--

<script src = "http://www.a.com">
<a href="http://www.b.com">
<a href= "http://www.c.com">
<a href ="http://www.d.com">

I want the match to be--

http://www.a.com
http://www.b.com
http://www.c.com
http://www.d.com

Any help would be appreciated, thanks!

Tony Stark
  • 3,353
  • 7
  • 26
  • 30
  • Are your url's really that simple, or will they contain parameters or longer paths? – Hemlock Jan 10 '11 at 01:53
  • /me facepalms http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Hello71 Jan 10 '11 at 02:40
  • @Hello71 I have done as you have asked, I have parsed the HTML with HTML5 Lib, I have fetched all the links, I have fixed all the encoding bugs, all the unknown unsupported unicode symbols and finally after weeks of work got those links from that html. Was it worth it? Maybe. Is the added complexity worth it? No it is not, parsing HTML is a lot harder than you think, HTML can contain other types of content and is extremely complicated, regex matching links might actually be the better answer here... that or a custom parser (which I also tried, great for really long texts). – Timo Huovinen Aug 05 '14 at 09:37

2 Answers2

2

John Gruber has an excellent regex for URLs over at his site, Daring Fireball: http://daringfireball.net/2010/07/improved_regex_for_matching_urls

You can implement it like so:

function regex(url) {
    var regex = /(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/
    return regex.test(url);
}
soren121
  • 368
  • 3
  • 16
0
function isUrl(url) {
    var regexp = /(http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?/
    return regexp.test(url);
}

It is a bit more generic, but you may modify it for your needs.

angularrocks.com
  • 26,767
  • 13
  • 87
  • 104