extract specific URLs from text

Question

I want to extract URLs from this text:

<body>
<a href="http://domaine.com/t/text/text"> <img src="http://domaine.com/i/text/text"></a> <br>
<a href="http://domaine.com/text"></a> <br>
<a href="http://domaine.com"></a> <br>
<a href="http://domaine.com/text/text"></a> <br>
<a href="http://[GoTo]"></a> <br>
<a href="http://[NextURL]"></a> <br>
</body>

but i want to exclude some URLs with specific patterns from being extracted; those patterns are:

http://***/i/***/***
http://***/t/***/***
http://[GoTo]
http://[NextURL]

which means i will just get this URLs as a result:

http://domaine.com/text
http://domaine.com
http://domaine.com/text/text

what i did so far is using this Regex:

$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
print_r($matches[0]);

but as you can notice i get all the URLs extracted, and i don't know how to exclude some of them using my specific petterns.

https://stackoverflow.com/a/1732454/3498950 – spencer.sm Jun 26 '17 at 17:05 — spencer.sm, Jun 26 '17 at 17:05

score 2 · Accepted Answer · answered Jun 26 '17 at 17:02

What you are looking for is a negative lookahead:

$regex = '/https?:\/\/(?!\[GoTo\]|\[NextURL\]|[^\" ]*\/i\/[^\" ]+|[^\" ]*\/t\/[^\" ]*)[^\" ]+/i';

?! at the beginning of a submatch should prevent matching for URLs with the enclosed pattern. This might need tweaking for specific corner cases, but with the problem as stated, this should get you what you need.

extract specific URLs from text

1 Answers1