0

Hey im trying to webscrape a specific thing on a website, like this

  <td><a href="javascript:void(0)" class="rankRow"
                                                                           data-rankkey="25">
                                                                                    Averages
                                                                            </a>
                                                                    </td>
                                                                    <td class="page_speed_602217763">
                                                                            82.84                                                                        </td>
                                                            </tr>

Where im trying to get the number 82,84 with the page_speed_** number variying and the on constant that differentiate from the rest of the sourcelist being the text "Averages"

I have tried using the preg_match_all but cant seem to search more than one line and whatevers in between.

My code i have used is the following

<form method="post">
<input type="text" name="Player1Link" placeholder="Player 1"> <br>
</form>

    <?php
$Player1Link = $_POST["Player1Link"];

            $curl = curl_init();
          curl_setopt($curl, CURLOPT_URL, $Player1Link);
          curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
          curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
          $curlresult = curl_exec($curl);
        $pattern = '!data-rankkey="25">[\s]*Averages[\s]*<\/a>[\s]*<\/td>[\s]*<td[^\s]*?class="page_speed_([\d]*)">[\s]*([\d]*.[\d]*)[\s]*</td>[\s]*<\/tr>!';
  preg_match_all($pattern, $curlresult, $matches);
          print_r($matches);
        
          $P1AvgHigh = $matches[0][3];
          echo "<br>";
          echo $P1AvgHigh;
          curl_close($curl);
    ?>

With my results being enter image description here and the website im using is

https://app.dartsorakel.com/player/stats/8 and the sourcelink view-source:https://app.dartsorakel.com/player/stats/8

Thanks in advance

Ko1ind
  • 39
  • 8

3 Answers3

1

Firstly your class declaration is incomplete and you've missed the contents of the second td ... maybe this is an incomplete copy from your code? You also need to take into account the white space in between and within every element.

This is my regex, which sees to work (but might need tweaking depending on your precise requirements and possible values in the content) ...

data-rankkey="25">[\s]*Averages[\s]*<\/a>[\s]*<\/td>[\s]*<td class="page_speed_([\d]*)">[\s]*([\d]*.[\d]*)[\s]*</td>[\s]*<\/tr>

I've escaped the forward slashes, which may not be necessary for you.

For future reference https://www.regexpal.com/ is a good tool for playing around with regular expressions

timchessish
  • 136
  • 5
1

You can simplify your Regex, it's always harder to maintain big Regex, especially if you scrap an other website:

$pattern = '/class="page_speed_\d+">\s*(\d+\.\d+)\s*/';
if (preg_match_all($pattern, $curlresult, $matches)) {
    $numbers = $matches[1];
    
    foreach ($numbers as $number) {
        echo $number . "\n";
    }
} else {
    echo "Not found.";
}
Vincent Decaux
  • 9,857
  • 6
  • 56
  • 84
1

As a wise man once said:

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Parsing HTML with Regex is nearly always a bad idea.

Use a proper HTML parser instead, for example XPath: //td[contains(@class, 'page_speed_')]

sample:

$html='  <td><a href="javascript:void(0)" class="rankRow"
                                                                           data-rankkey="25">
                                                                                    Averages
                                                                            </a>
                                                                    </td>
                                                                    <td class="page_speed_602217763">
                                                                            82.84                                                                        </td>
                                                            </tr
>';
$domd = new DOMDocument();
@$domd->loadHTML($html);
$xp = new DOMXPath($domd);
$page_speed = $xp->query("//td[contains(@class, 'page_speed_')]")->item(0)->textContent;
$page_speed = trim($page_speed);
var_dump($page_speed);

dumps:

string(5) "82.84"
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • Thank you! How would i go about copying the html code automaticly from the link? – Ko1ind Aug 26 '23 at 13:27
  • 1
    @Ko1ind oh yes use curl for getting the html. – hanshenrik Aug 27 '23 at 16:58
  • Thank you! Works like a charm. Just one more question, i want to scrape one more thing, which does not have the same item place on the diffrent players. Im gonna add a comment under the original post with more information. Hope you will take a look – Ko1ind Aug 28 '23 at 08:58
  • I posted it as an answer so i could be more specific – Ko1ind Aug 28 '23 at 09:08