1

I want to grab the URL with highest pg value:

$html ='
    <a href="http://example.com/?pg=1"></a>
    <a href="http://example.com/?pg=2"></a>
    <a href="http://example.com/?pg=3"></a>
';

I use this regex to locate the appropriate links:

preg_match_all('/<a.*href="\.\/\?pg=(\d+)".*>(?:.*)<\/a>/U', $html, $preg_matches);

Sometimes, the links include another parameter:

http://example.com/?pg=3&test=1

My question is, how do I adjust my regex so links with the added parameters are included as well?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Henrik Petterson
  • 6,862
  • 20
  • 71
  • 155
  • 1
    You have already asked it [here](https://stackoverflow.com/questions/51945868/grab-parameter-in-html-with-highest-value), isn't that the same question? – Wiktor Stribiżew Aug 27 '18 at 14:11
  • 1
    @WiktorStribiżew No, this question is targeting URLs with multiple parameters. – Henrik Petterson Aug 27 '18 at 14:12
  • `\.` matches a dot. You must match any chars other than `"` with `[^"]`. – Wiktor Stribiżew Aug 27 '18 at 14:12
  • 1
    @WiktorStribiżew It does not include URLs with multiple parameters. Try adding `a` to the `$html` variable and you will see. – Henrik Petterson Aug 27 '18 at 14:13
  • @WiktorStribiżew Can you please post an answer to demonstrate this? Thanks. – Henrik Petterson Aug 27 '18 at 14:15
  • A regex solution is not actually recommended. If the current one is that difficult for you to fix, why use regex by all means? You might use something [like this](https://regex101.com/r/3OihCO/1), but it will still fail to work in some cases although it should work in a lot more cases than your current one. Using a DOM parser is the best solution for such scenarios. XPath can also be coupled with regex if needed, but that does not seem necessary in your case. – Wiktor Stribiżew Aug 27 '18 at 14:19

2 Answers2

1
  1. Use a DOM parser to find the anchors.
  2. Use parse_url to parse the urls and get the query value
  3. use parse_str to get the query values

Example:

$dom = new DOMDocument;
$dom->loadHTML($html);

$html ='
    <a href="http://example.com/?pg=1"></a>
    <a href="http://example.com/?pg=2"></a>
    <a href="http://example.com/?pg=3"></a>
';
$anchors = $dom->getElementsByTagName('a');

foreach ($anchors as $anchor) {
        $url = $anchor->getAttribute('href');
        $query = parse_url($url, PHP_URL_QUERY);
        parse_str($query, $output);
        $pg = $output['pg'];
        //do something
}

Here's a helpful tutorial for PHP. http://htmlparsing.com/php.html

Also see here, why you should not use Regex for parsing html https://stackoverflow.com/a/1732454/81785

Moak
  • 12,596
  • 27
  • 111
  • 166
0
        $html ='
        <a href="http://example.com/?pg=1"></a>
        <a href="http://example.com/?pg=2"></a>
        <a href="http://example.com/?pg=4&test=1"></a>
    ';
        preg_match_all('/<a[^>]+href=\"(.*?)\"[^>]*>(.*)?<\/a>/', $html, $out);

        $result = null;
        foreach ($out[1] as $link){
            parse_str(parse_url($link, PHP_URL_QUERY), $atr);
            $result[$link] = $atr['pg'];
        }

        print_r($result);

//        "http://example.com/?pg=1" => "1"
//        "http://example.com/?pg=2" => "2"
//        "http://example.com/?pg=4&test=1" => "4"
TsV
  • 629
  • 4
  • 7