How do i scrape multiple lines in the sourcelist using cURL and preg_match_all

Question

Hey im trying to webscrape a specific thing on a website, like this

  <td><a href="javascript:void(0)" class="rankRow"
                                                                           data-rankkey="25">
                                                                                    Averages
                                                                            </a>
                                                                    </td>
                                                                    <td class="page_speed_602217763">
                                                                            82.84                                                                        </td>
                                                            </tr>

Where im trying to get the number 82,84 with the page_speed_** number variying and the on constant that differentiate from the rest of the sourcelist being the text "Averages"

I have tried using the preg_match_all but cant seem to search more than one line and whatevers in between.

My code i have used is the following

<form method="post">
<input type="text" name="Player1Link" placeholder="Player 1"> <br>
</form>

    <?php
$Player1Link = $_POST["Player1Link"];

            $curl = curl_init();
          curl_setopt($curl, CURLOPT_URL, $Player1Link);
          curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
          curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
          $curlresult = curl_exec($curl);
        $pattern = '!data-rankkey="25">[\s]*Averages[\s]*<\/a>[\s]*<\/td>[\s]*<td[^\s]*?class="page_speed_([\d]*)">[\s]*([\d]*.[\d]*)[\s]*</td>[\s]*<\/tr>!';
  preg_match_all($pattern, $curlresult, $matches);
          print_r($matches);
        
          $P1AvgHigh = $matches[0][3];
          echo "<br>";
          echo $P1AvgHigh;
          curl_close($curl);
    ?>

With my results being and the website im using is

https://app.dartsorakel.com/player/stats/8 and the sourcelink view-source:https://app.dartsorakel.com/player/stats/8

Thanks in advance

score 1 · Answer 1 · answered Aug 24 '23 at 08:45

1

Firstly your class declaration is incomplete and you've missed the contents of the second td ... maybe this is an incomplete copy from your code? You also need to take into account the white space in between and within every element.

This is my regex, which sees to work (but might need tweaking depending on your precise requirements and possible values in the content) ...

data-rankkey="25">[\s]*Averages[\s]*<\/a>[\s]*<\/td>[\s]*<td class="page_speed_([\d]*)">[\s]*([\d]*.[\d]*)[\s]*</td>[\s]*<\/tr>

I've escaped the forward slashes, which may not be necessary for you.

For future reference https://www.regexpal.com/ is a good tool for playing around with regular expressions

answered Aug 24 '23 at 08:45

timchessish

136
5

Thanks, although i cant completly seem to get it working, im gonna update my first question with my current code – Ko1ind Aug 24 '23 at 10:09
in yours ` – timchessish Aug 24 '23 at 10:27
So like this data-rankkey="25">[\s]*Averages[\s]*<\/a>[\s]*<\/td>[\s]*[\s]*([\d]*.[\d]*)[\s]*[\s]*<\/tr> Because for me it unfortunatly still does not work – Ko1ind Aug 24 '23 at 10:48
all the "\s"s need * after them and, to be safe and as a good habit, I would escape all of the forward slashes. – timchessish Aug 24 '23 at 11:34
Okay, thank you so much :-) – Ko1ind Aug 24 '23 at 12:35

score 1 · Answer 2 · answered Aug 24 '23 at 08:56

You can simplify your Regex, it's always harder to maintain big Regex, especially if you scrap an other website:

$pattern = '/class="page_speed_\d+">\s*(\d+\.\d+)\s*/';
if (preg_match_all($pattern, $curlresult, $matches)) {
    $numbers = $matches[1];
    
    foreach ($numbers as $number) {
        echo $number . "\n";
    }
} else {
    echo "Not found.";
}

hanshenrik · Accepted Answer · 2023-08-24T13:44:17.460

As a wise man once said:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Parsing HTML with Regex is nearly always a bad idea.

Use a proper HTML parser instead, for example XPath: //td[contains(@class, 'page_speed_')]

sample:

$html='  <td><a href="javascript:void(0)" class="rankRow"
                                                                           data-rankkey="25">
                                                                                    Averages
                                                                            </a>
                                                                    </td>
                                                                    <td class="page_speed_602217763">
                                                                            82.84                                                                        </td>
                                                            </tr
>';
$domd = new DOMDocument();
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$page_speed = $xp->query("//td[contains(@class, 'page_speed_')]")->item(0)->textContent;
$page_speed = trim($page_speed);
var_dump($page_speed);

dumps:

string(5) "82.84"

3v4l link: https://3v4l.org/3JAKR

Thank you! How would i go about copying the html code automaticly from the link? — Ko1ind, Aug 26 '23 at 13:27
Thank you! Works like a charm. Just one more question, i want to scrape one more thing, which does not have the same item place on the diffrent players. Im gonna add a comment under the original post with more information. Hope you will take a look — Ko1ind, Aug 28 '23 at 08:58

How do i scrape multiple lines in the sourcelist using cURL and preg_match_all

3 Answers3