0

I would like to get back the number which is between span HTML tags. The number may change!

<span class="topic-count">
  ::before
  "
             24
          "
  ::after
</span>

I've tried the following code:

preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]); 

But it doesn't work.

Entire code:

$result=array();
$page = 201;
while ($page>=1) {
    $source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
    preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
    $result = array_merge($result, $nombre[$i][1]);
    print("Page : ".$page ."\n");
    $page-=25;
}
print_r ($nombre); 
Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Diamonds
  • 117
  • 1
  • 9
  • 4
    DO NOT USE REGEX FOR HTML PARSING ! First get your span value, then use regex on it... ! – Random Feb 09 '17 at 09:29
  • Add the s modifier, so the dot also matches line breaks. Edit: +1 what Random said. ;) – Constantin Groß Feb 09 '17 at 09:29
  • 1
    Also, if you only want a number match for \d+ – Gordon Feb 09 '17 at 09:32
  • There a lot of `topic-count` do you want to get all of them? – Teymur Mardaliyer Lennon Feb 09 '17 at 09:34
  • s modifier works, will try \d+ to get only number and not the space. And Teymur it depends on how much page I parse. There is 25 "topic-count" per page ! – Diamonds Feb 09 '17 at 09:37
  • I've tried (\d+), but it doesn't work... return me an empty array – Diamonds Feb 09 '17 at 09:43
  • @Diamonds I couldn't understand something. You want to get `topic-count`, do you? And there a lot of elements with `topic-count` in every page. When you run your code you get the first-row `topic-count` - as `NB`. – Teymur Mardaliyer Lennon Feb 09 '17 at 09:44
  • @TeymurMardaliyerLennon Yeah that it, for exemple on the first page I get that : [link](http://image.noelshack.com/fichiers/2017/06/1486633870-6626-noelpush.png) [link](http://image.noelshack.com/fichiers/2017/06/1486633912-0973-noelpush.png) – Diamonds Feb 09 '17 at 09:51

1 Answers1

1

Can do with

preg_match_all(
    '#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s', 
    $html, 
    $matches
);

which would capture any digits before the end of the span.

However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.

Hence the recommendation to use a DOM parser instead, e.g.

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);

$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(@class, "topic-count")]') as $node) {
    if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
        echo $topics[0][0], PHP_EOL;
    }
}

DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression

//span[contains(@class, "topic-count")]

which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • Thanks, it works perfectly. I will try to use the DOM parser too ! And @Gordon could you tell me what does [^\d]* means/do ? – Diamonds Feb 09 '17 at 09:54
  • @Diamonds [] indicates a character group. It means match anything inside the group. A ^ at the beginning means negate the group, so do not match anything inside the group, so [^\d]* means do not match any digits. See `https://regexper.com/#%5B%5E%5Cd%5D*(%5Cd%2B)%5B%5E%5Cd%5D*%3F`. Also consider https://regexone.com – Gordon Feb 09 '17 at 10:07
  • Thanks, very helpful tools ! – Diamonds Feb 09 '17 at 10:17
  • There is a way to improve my script speed execution ? Actually it's ~0.5sec per page. Can I optimize it ? – Diamonds Feb 09 '17 at 12:09
  • @Diamonds most of that is likely HTTP overhead. Can't do much about it. Make sure to use the latest PHP version and have OpCache enabled – Gordon Feb 09 '17 at 12:31