Regex that only captures http/https in plain text

Question

I current have str.match(/(http[^\s]+)/i) which not only captures link in the content, but also in img tag(src="http...") and anchor tag(href="http...")

How do I modify my regex so that it matches only "http/s" that has no "src=" or "href=" before it?

http://stackoverflow.com/questions/4643142/regex-to-test-if-string-begins-with-http-or-https — caramba, Apr 22 '15 at 20:15
May be easiest to just get all text nodes first and search only those but it depends on what you're doing. — Explosion Pills, Apr 22 '15 at 20:15
Maybe parsing HTML with regular expressions isn't a really good idea, and you should get the proper elements, then the text from those elements, before you use a regex ? — adeneo, Apr 22 '15 at 20:18

score 3 · Answer 1 · edited Apr 23 '15 at 01:35

3

You can use an additional \s. href or src will not have a whitespace character before the URL. In normal text, there is a whitespace.

str.match(/\s(http[^\s]+)/i)

Also see DEMO

edited Apr 23 '15 at 01:35

MaxZoom

7,619
5
28
44

answered Apr 22 '15 at 20:15

ByteHamster

4,884
9
38
53

score 1 · Answer 2 · answered Apr 22 '15 at 20:18

1

You can catch links that don't start with an = nor a quote before the http/s:

str.match(/[^=\"](http[^\s]+)/i)

answered Apr 22 '15 at 20:18

Dmitry Sadakov

2,128
3
19
34

score 0 · Answer 3 · answered Apr 22 '15 at 20:54

You can overmatch using simple http[^\s]+ (=http\S+).

I'd suggest to use a regex to match text outside of tags, and whitelist those tags where you allow the text to appear. Here is the regex:

/(?![^<]*>|[^<>]*<\/(?!p\b|td|pre))https?:\/\/[a-z0-9&#=.\/\-?_]+/gi

(?!p\b|td|pre) part is where we add whitelisted tags. The regex won't capture http://example.com,.

See demo

Regex that only captures http/https in plain text

3 Answers3