Match link patterns in HTML code with a RegEx

Question

I'm using a linkify function, which detects link-like patterns by using regex and replaces those with a-tags to reveal a clickable link.

The regex looks like that:

    // http://, https://, ftp:// 
    var urlPattern = /\b(?![^<]*>|[^<>]*<\/)(?:https?|ftp):\/\/[a-z0-9-+&@#\/%?=~_|!:,.;]*[a-z0-9-+&@#\/%=~_|]/gim;
    /* Some explanations:
    (?!     # Negative lookahead start (will cause match to fail if contents match)
    [^<]*   # Any number of non-'<' characters
    >       # A > character
    |       # Or
    [^<>]*  # Any number of non-'<' and non-'>' characters
    </      # The characters < and /
     )      # End negative lookahead.
    */

and replaces the link like this:

 return textInput.replace(urlPattern, '<a target="_blank" rel="noopener" href="$&">$&</a>')

The regex works perfectly for in-text links. However, I am using it in HTML-Code also, such as

<ul><li>Link: https://www.link.com</li></ul> //linkify not working
<ul><li>Link: https://www.link.com <br/></li></ul> //linkify working

where just the secont example is working. I dont't know why the behavior is different and would be very glad to get some help from you. What should my regex look like, to linkify without the break in list elements?

Suggestion: [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/a/1732454/2430549) — HoldOffHunger, Oct 19 '20 at 17:10

score 1 · Accepted Answer · edited Nov 22 '22 at 23:01

If I understood correctly your issue I think that this regex should be ok to detect the links in both the scenarios:

\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*)

Essentially with the first part we are segmenting in this way:

Then we go and grab the different parts of interest: the first part is a non-capturing group as in your original expression to strip the protocol later, if really unneeded. The last part takes the remaining part of the URL

For the way we created the regex we can now decide if taking the entire URL or just the second part. This is evident looking to the bottom-right of this screenshot:

Now in order to log the two parts we can take this nice snippet:

const str = '<ul><li>Link: https://www.link.com</li></ul>';
var myRegexp = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*)/gim;
var match = myRegexp.exec(str);
console.log(match[0]);
console.log(match[1]);

Possible variations:

in a situation like the one presented above you can simplify further your regex to:

(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*)

getting the same output

if the full URL is enough you can remove the round parentheses of the second group

(?:https?|ftp):\/\/[a-z0-9-+&@#\/%?=~_|!:,.;]*

PS - I'm assuming that your examples were meant to be:

<ul><li>Link: https://www.link.com</li></ul>
<ul><li>Link: https://www.link.com <br/></li></ul>

i.e. with https, http or ftp which makes the second case work with your original regex

Thanks, the first shortened version works for me. It has just one drawback - if there is no space, but a sign such as "." or "," at the end of the link, it will be integrated in the link as well - i.e. in "The link is https://www.link.com." — Nixen85, Oct 20 '20 at 11:43
So the best result I get with `var urlPattern = /\b(?![^<]*>)(?:https?|ftp):\/\/([a-z0-9-+&@#\/%?=~_|!:,.;]*[a-z0-9-+&@#\/%=~_|])/gim;` It takes care of non-relevant special charactes at the end (i.e. points and commas) and also detects markdown formatting such as `[url="https://link.to",name="linkname",title="description of the link"]`, which is relevant in my case. — Nixen85, Oct 20 '20 at 12:11
@Nixen85 Yep, happy you could tailor it exactly for your needs. I had just tested the examples provided. Thanks a lot and have a good day! — Antonino, Oct 20 '20 at 12:42

Match link patterns in HTML code with a RegEx

1 Answers1