7

I have a paragraph of text which may contain some links in plain text, or some links which are actually links.

For example:

Posting a link: http://test.com, posting an image <img src="http://test.com/2.jpg" />. Posting an actual A tag: <a href="http://test.com/test.html">http://test.com/test.html</a>

I need to fish out the unformatted links from this piece of text. So any regular expression that will match the first case, but not the second or third case because they are already well formatted links.

I've managed to fish out all the links with this regex: ((http:|https:)\/\/[a-zA-Z0-9&#=.\/\-?_]+), however, am still having trouble distinguishing between the cases.

This needs to be in javascript so I don't think negative lookbehind is allowed.

Any help would be appreciated.

EDIT: I'm trying to wrap the fished out unformatted links in an a tag.

ketan
  • 19,129
  • 42
  • 60
  • 98
l3utterfly
  • 2,106
  • 4
  • 32
  • 58

1 Answers1

6

You can use this regex to get URLs outside of tags:

(?![^<]*>|[^<>]*<\/)((http:|https:)\/\/[a-zA-Z0-9&#=.\/\-?_]+)

See demo

We can shorten it a bit, too, with an i option:

(?![^<]*>|[^<>]*<\/)((https?:)\/\/[a-z0-9&#=.\/\-?_]+)

See another demo

Sample code:

var re = /(?![^<]*>|[^<>]*<\/)((https?:)\/\/[a-z0-9&#=.\/\-?_]+)/gi; 
var str = 'Posting a link: http://test.com, posting an image <img src="http://test.com/2.jpg" />. Posting an actual A tag: <a href="http://test.com/test.html">http://test.com/test.html</a>';
var val = re.exec(str);
document.getElementById("res").innerHTML = "<b>URL Found</b>: " + val[1];
var subst = '<a href="$1">$1</a>'; 
var result = str.replace(re, subst);
document.getElementById("res").innerHTML += "<br><b>Replacement Result</b>: " + result;
<div id="res"/>

Update:

To allow capturing inside specific tags, you can whitelist them like this:

var re = /(?![^<]*>|[^<>]*<\/(?!(?:p|pre)>))((https?:)\/\/[a-z0-9&#=.\/\-?_]+)/gi;
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I think this answer misses a case: what if the link is in a normal tag, i.e. not an A Tag? e.g. `

    http://test.com

    `
    – l3utterfly Apr 21 '15 at 09:02
  • Well, the best approach here would be using an HTML parser together with a regex. `

    ` is a container tag. If you do not want to use any HTML parser, you still may try a regex: `var re = /(?![^<]*>|[^<>]*<\/(?!(?:p|pre)>))((https?:)\/\/[a-z0-9=.\/\-?_]+)/gi;`. Please check https://regex101.com/r/mU5rR8/3.

    – Wiktor Stribiżew Apr 21 '15 at 09:17
  • I think I see how you are doing it. I really hoped to avoid an HTML parser. Actually I just want to NOT match links within an A tag, and not match links that are part of an HTML attribute. So those malformed tags such as [link] can be matched, also that pre, too – l3utterfly Apr 21 '15 at 10:15
  • (?![^<]*>|[^<>]*<\/(?!(?:[^a])))((https?:)\/\/[a-z0-9=.\/\-?_]+) : I wanted to parse everything not inside a tags – Bhumi Singhal Jan 27 '16 at 09:16