How to get substring between text when the text repeats and includes a new line?

Question

I need to extract the second URL from this string:

$string = '<td class="table_td">   submitted by   <a href="https://www.example.com/account/user" target="_blank" rel="nofollow"> account </a> <br>
 <a href="https://www.URL-I-NEED.com/BKHHZu_A4lu" target="_blank" rel="nofollow">[site]</a>   <a href="https://www.example.com/settings/user/" target="_blank" rel="nofollow">[settings]</a></td>';

I tried this solution, and tried these settings:

$startTag = ' <a href="';
$endTag = '" target';

But it returned the first URL and not the one I need since those tags also appear before the substring I need.

I tried adding the <br> before the newline to $startTag, but it returned no string.

Basically, I need $startTag needs to be {newline} <a href=", but I can't figure out how to include that newline.

Or maybe I'm thinking about this wrongly, and there is a simpler way to do this by simply extracting all the URL's from that string, and then simply selecting the 2nd one.

Either way, how can I extract the 2nd URL in the string above?

@anubhava Is there any reasoning for this? I'd love to read about it :) — GrumpyCrouton, Jul 21 '17 at 16:44
[Read this FAQ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — anubhava, Jul 21 '17 at 16:54
@anubhava Thanks, and how would I solve this with the `DOM` parser you are suggesting? — ProgrammerGirl, Jul 21 '17 at 16:58
The second url is always in group 1 - https://regex101.com/r/upLwVm/1 `(?:(?:https?|ftp):\/\/)[\S\s]+?((?:(?:https?|ftp):\/\/)(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{a1}-\x{ffff}]{2,})))|localhost)(?::\d{2,5})?(?:\/[^\s]*)?)` — , Jul 21 '17 at 17:05

score 2 · Accepted Answer · answered Jul 21 '17 at 17:07

You can use DOM parser as this code:

$string = '<td class="table_td">   submitted by
<a href="https://www.example.com/account/user" target="_blank" rel="nofollow"> account </a> <br>
<a href="https://www.URL-I-NEED.com/BKHHZu_A4lu" target="_blank" rel="nofollow">[site]</a>
<a href="https://www.example.com/settings/user/" target="_blank" rel="nofollow">[settings]</a>
</td>';

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($string); // loads your html
$xpath = new DOMXPath($doc);

// query all <a...> elements
$nodelist = $xpath->query("//a");

// get 2nd element from the list
$node = $nodelist->item(1);

// extract href attribute
$link = $node->getAttribute('href');

echo $link . "\n";
//=> https://www.URL-I-NEED.com/BKHHZu_A4lu

Code Demo

You can also use `DOMXPath::evaluate` to get the string you want: `$link = $xpath->evaluate("string(//td/a[2]/@href)");` — Casimir et Hippolyte, Jul 21 '17 at 18:16

How to get substring between text when the text repeats and includes a new line?

1 Answers1