Targeting URLs with parameters

Question

I want to grab the URL with highest pg value:

$html ='
    <a href="http://example.com/?pg=1"></a>
    <a href="http://example.com/?pg=2"></a>
    <a href="http://example.com/?pg=3"></a>
';

I use this regex to locate the appropriate links:

preg_match_all('/<a.*href="\.\/\?pg=(\d+)".*>(?:.*)<\/a>/U', $html, $preg_matches);

Sometimes, the links include another parameter:

http://example.com/?pg=3&test=1

My question is, how do I adjust my regex so links with the added parameters are included as well?

You have already asked it [here](https://stackoverflow.com/questions/51945868/grab-parameter-in-html-with-highest-value), isn't that the same question? — Wiktor Stribiżew, Aug 27 '18 at 14:11
@WiktorStribiżew No, this question is targeting URLs with multiple parameters. — Henrik Petterson, Aug 27 '18 at 14:12
`\.` matches a dot. You must match any chars other than `"` with `[^"]`. — Wiktor Stribiżew, Aug 27 '18 at 14:12
@WiktorStribiżew It does not include URLs with multiple parameters. Try adding `a` to the `$html` variable and you will see. — Henrik Petterson, Aug 27 '18 at 14:13
@WiktorStribiżew Can you please post an answer to demonstrate this? Thanks. — Henrik Petterson, Aug 27 '18 at 14:15
A regex solution is not actually recommended. If the current one is that difficult for you to fix, why use regex by all means? You might use something [like this](https://regex101.com/r/3OihCO/1), but it will still fail to work in some cases although it should work in a lot more cases than your current one. Using a DOM parser is the best solution for such scenarios. XPath can also be coupled with regex if needed, but that does not seem necessary in your case. — Wiktor Stribiżew, Aug 27 '18 at 14:19

Moak · Accepted Answer · 2018-08-27T14:20:39.323

Use a DOM parser to find the anchors.
Use parse_url to parse the urls and get the query value
use parse_str to get the query values

Example:

$dom = new DOMDocument;
$dom->loadHTML($html);

$html ='
    <a href="http://example.com/?pg=1"></a>
    <a href="http://example.com/?pg=2"></a>
    <a href="http://example.com/?pg=3"></a>
';
$anchors = $dom->getElementsByTagName('a');

foreach ($anchors as $anchor) {
        $url = $anchor->getAttribute('href');
        $query = parse_url($url, PHP_URL_QUERY);
        parse_str($query, $output);
        $pg = $output['pg'];
        //do something
}

Here's a helpful tutorial for PHP. http://htmlparsing.com/php.html

Also see here, why you should not use Regex for parsing html https://stackoverflow.com/a/1732454/81785

score 0 · Answer 2 · answered Aug 27 '18 at 14:24

        $html ='
        <a href="http://example.com/?pg=1"></a>
        <a href="http://example.com/?pg=2"></a>
        <a href="http://example.com/?pg=4&test=1"></a>
    ';
        preg_match_all('/<a[^>]+href=\"(.*?)\"[^>]*>(.*)?<\/a>/', $html, $out);

        $result = null;
        foreach ($out[1] as $link){
            parse_str(parse_url($link, PHP_URL_QUERY), $atr);
            $result[$link] = $atr['pg'];
        }

        print_r($result);

//        "http://example.com/?pg=1" => "1"
//        "http://example.com/?pg=2" => "2"
//        "http://example.com/?pg=4&test=1" => "4"

Targeting URLs with parameters

2 Answers2