0

I have a text in form of:

Txx8xxTT<br><br><br>https://wwww.xxx.com<br><br />
<br />cxyc[link=http://www.example.com]link[/odkaz]
xxx<a href="http://www.example2.com">link2</a>

I want to parse this using preg_match_all where in the result array all standalone links are at separate indices. In the example case I want to have something like this:

[0] => Txx8xxTT<br><br><br>
[1] => https://wwww.xxx.com
[2] => <br><br />
    <br />cxyc[link=http://www.example.com]link[/odkaz]
    xxx<a href="http://www.example2.com">link2</a>

(The array can be formatted differently, I dont care about the indices, but I want the separate links at its own index)

I have tried to use preg_match_all with (.[^ \<\[]*). It almost works, but I get the result at index [3] as <br>https://wwww.xxx.com, where I dont want the <br> prefix.

[0] => Txx8xxTT
[1] => <br>
[2] => <br>
[3] => <br>https://wwww.xxx.com
[4] => <br>
[5] => <br
[6] =>  /> 
[7] => <br
[8] =>  />cxyc
[9] => [link="http://www.example.com"]link
[10] => [/odkaz]xxx
[11] => <a
[12] =>  href="http://www.example2.com">link2
[13] => </a>
Martin Perry
  • 9,232
  • 8
  • 46
  • 114
  • Don't use regex. It really isn't suited to this task. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – delboy1978uk May 09 '19 at 11:56

2 Answers2

3

Probably best to:

  1. Parse your input via a HTML / DOM parser
  2. Use DOM / XPath to find your text nodes
  3. Extract the URL using regex

An example of 1 and 2 can be found here: https://stackoverflow.com/a/6399988/406712

Then for your regex consider a "negative lookbehind" to exclude the link that starts with "[link=":

Use

preg_match_all('/(?<!\[link=)\bhttps?:\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[A-Z0-9+&@#\/%=~_|]/i', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    # Matched text = $result[0][$i];
}

Regular Expression

(?<!\[link=)\bhttps?://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|]

Visualisation

Regex Visualization

PS. if you are going to be modifying the HTML input then use DOM methods to do that.

Dean Taylor
  • 40,514
  • 3
  • 31
  • 50
2

See my comment above explaining the horror of parsing html with regex. It really isn't the best approach. DOMDocument may be a better idea.

If you just want an array of links, you could try this. I guarantee nothing however.

#https?:\/\/[a-z1-9\.]+#

This returns:

Match 1
Full match  20-40   https://wwww.xxx.com
Match 2
Full match  67-89   http://www.example.com
Match 3
Full match  115-138 http://www.example2.com

https://regex101.com/r/Sh5CTa/1

UPDATE since you dont want href= or link=, you could try this?

#>(?<link>https?:\/\/[a-z1-9\.]+)<#

It uses a named capture group, so it would be $matches['link']

https://regex101.com/r/Sh5CTa/2

delboy1978uk
  • 12,118
  • 2
  • 21
  • 39
  • I donw want all links, only the standalone ones, because I need to wrap them with href and other "decoration". My regexp is working as it should, the only problem is I would like to remove `
    ` from [3]. I can do this with PHP, but it seems that I am missing something in my regex which would be more legenat than using PHP for matches.
    – Martin Perry May 09 '19 at 12:03
  • 1
    what do you mean by standalone? – delboy1978uk May 09 '19 at 12:06
  • @delboy1978uk standalone link = not inside href or my own [link=] – Martin Perry May 09 '19 at 12:20