0

I need to extract the second URL from this string:

$string = '<td class="table_td">   submitted by   <a href="https://www.example.com/account/user" target="_blank" rel="nofollow"> account </a> <br>
 <a href="https://www.URL-I-NEED.com/BKHHZu_A4lu" target="_blank" rel="nofollow">[site]</a>   <a href="https://www.example.com/settings/user/" target="_blank" rel="nofollow">[settings]</a></td>';

I tried this solution, and tried these settings:

$startTag = ' <a href="';
$endTag = '" target';

But it returned the first URL and not the one I need since those tags also appear before the substring I need.

I tried adding the <br> before the newline to $startTag, but it returned no string.

Basically, I need $startTag needs to be {newline} <a href=", but I can't figure out how to include that newline.

Or maybe I'm thinking about this wrongly, and there is a simpler way to do this by simply extracting all the URL's from that string, and then simply selecting the 2nd one.

Either way, how can I extract the 2nd URL in the string above?

ProgrammerGirl
  • 3,157
  • 7
  • 45
  • 82
  • 2
    Avoid regex for HTML parsing. Use `DOM` parser. – anubhava Jul 21 '17 at 16:33
  • The newline character in regex is `\n`. – RToyo Jul 21 '17 at 16:36
  • @anubhava Is there any reasoning for this? I'd love to read about it :) – GrumpyCrouton Jul 21 '17 at 16:44
  • [Read this FAQ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – anubhava Jul 21 '17 at 16:54
  • @anubhava Thanks, and how would I solve this with the `DOM` parser you are suggesting? – ProgrammerGirl Jul 21 '17 at 16:58
  • The second url is always in group 1 - https://regex101.com/r/upLwVm/1 `(?:(?:https?|ftp):\/\/)[\S\s]+?((?:(?:https?|ftp):\/\/)(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{a1}-\x{ffff}0-9]+-?)*[a-z\x{a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{a1}-\x{ffff}]{2,})))|localhost)(?::\d{2,5})?(?:\/[^\s]*)?)` –  Jul 21 '17 at 17:05

1 Answers1

2

You can use DOM parser as this code:

$string = '<td class="table_td">   submitted by
<a href="https://www.example.com/account/user" target="_blank" rel="nofollow"> account </a> <br>
<a href="https://www.URL-I-NEED.com/BKHHZu_A4lu" target="_blank" rel="nofollow">[site]</a>
<a href="https://www.example.com/settings/user/" target="_blank" rel="nofollow">[settings]</a>
</td>';

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($string); // loads your html
$xpath = new DOMXPath($doc);

// query all <a...> elements
$nodelist = $xpath->query("//a");

// get 2nd element from the list
$node = $nodelist->item(1);

// extract href attribute
$link = $node->getAttribute('href');

echo $link . "\n";
//=> https://www.URL-I-NEED.com/BKHHZu_A4lu

Code Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643