PHP regexp (preg_match_all) - find all standalone links

Question

I have a text in form of:

Txx8xxTT<br><br><br>https://wwww.xxx.com<br><br />
<br />cxyc[link=http://www.example.com]link[/odkaz]
xxx<a href="http://www.example2.com">link2</a>

I want to parse this using preg_match_all where in the result array all standalone links are at separate indices. In the example case I want to have something like this:

[0] => Txx8xxTT<br><br><br>
[1] => https://wwww.xxx.com
[2] => <br><br />
    <br />cxyc[link=http://www.example.com]link[/odkaz]
    xxx<a href="http://www.example2.com">link2</a>

(The array can be formatted differently, I dont care about the indices, but I want the separate links at its own index)

I have tried to use preg_match_all with (.[^ \<\[]*). It almost works, but I get the result at index [3] as <br>https://wwww.xxx.com, where I dont want the <br> prefix.

[0] => Txx8xxTT
[1] => <br>
[2] => <br>
[3] => <br>https://wwww.xxx.com
[4] => <br>
[5] => <br
[6] =>  /> 
[7] => <br
[8] =>  />cxyc
[9] => [link="http://www.example.com"]link
[10] => [/odkaz]xxx
[11] => <a
[12] =>  href="http://www.example2.com">link2
[13] => </a>

Don't use regex. It really isn't suited to this task. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — delboy1978uk, May 09 '19 at 11:56

score 3 · Answer 1 · answered May 09 '19 at 12:06

Probably best to:

Parse your input via a HTML / DOM parser
Use DOM / XPath to find your text nodes
Extract the URL using regex

An example of 1 and 2 can be found here: https://stackoverflow.com/a/6399988/406712

Then for your regex consider a "negative lookbehind" to exclude the link that starts with "[link=":

Use

preg_match_all('/(?<!\[link=)\bhttps?:\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[A-Z0-9+&@#\/%=~_|]/i', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    # Matched text = $result[0][$i];
}

Regular Expression

(?<!\[link=)\bhttps?://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|]

Visualisation

PS. if you are going to be modifying the HTML input then use DOM methods to do that.

delboy1978uk · Answer 2 · 2019-05-09T13:05:50.510

2

See my comment above explaining the horror of parsing html with regex. It really isn't the best approach. DOMDocument may be a better idea.

If you just want an array of links, you could try this. I guarantee nothing however.

#https?:\/\/[a-z1-9\.]+#

This returns:

Match 1
Full match  20-40   https://wwww.xxx.com
Match 2
Full match  67-89   http://www.example.com
Match 3
Full match  115-138 http://www.example2.com

https://regex101.com/r/Sh5CTa/1

UPDATE since you dont want href= or link=, you could try this?

#>(?<link>https?:\/\/[a-z1-9\.]+)<#

It uses a named capture group, so it would be $matches['link']

https://regex101.com/r/Sh5CTa/2

edited May 09 '19 at 13:05

answered May 09 '19 at 12:01

delboy1978uk

12,118
2
21
39

I donw want all links, only the standalone ones, because I need to wrap them with href and other "decoration". My regexp is working as it should, the only problem is I would like to remove `
` from [3]. I can do this with PHP, but it seems that I am missing something in my regex which would be more legenat than using PHP for matches. – Martin Perry May 09 '19 at 12:03
1

what do you mean by standalone? – delboy1978uk May 09 '19 at 12:06
@delboy1978uk standalone link = not inside href or my own [link=] – Martin Perry May 09 '19 at 12:20

PHP regexp (preg_match_all) - find all standalone links

2 Answers2

Use

Regular Expression

Visualisation