1

For learning purposes, I'm trying to fetch data from the Steam Store, where if the image game_header_image_full exists, I've reached a game. Both alternatives are sort of working, but there's a catch. One is really slow, and the other seems to miss some data and therefore not writing the URL's to a text file.

For some reason, Simple HTML DOM managed to catch 9 URL's, whilst the 2nd one (cURL) only caught 8 URL's with preg_match.

Question 1.

Is $reg formatted in a way that $html->find('img.game_header_image_full') would catch, but not my preg_match? Or is the problem something else?

Question 2.

Am I doing things correctly here? Planning to go for the cURL alternative, but can I make it faster somehow?

Simple HTML DOM Parser (Time to search 100 ids: 1 min, 39s. Returned: 9 URL.)

<?php
    include('simple_html_dom.php');

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {
        // Find target image
        $url = "http://store.steampowered.com/app/".$i;
        $html = file_get_html($url);
        $element = $html->find('img.game_header_image_full');

        if($i == $times_to_run) {
            echo "Success!";
        }

        foreach($element as $key => $value){
        // Check if image was found
            if (strpos($value,'img') == false) {
                // Do nothing, repeat loop with $i++;

            } else {
                // Add (don't overwrite) to file steam.txt
                file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
            }
        }
    }
?>

vs. the cURL alternative.. (Time to search 100 ids: 34s. Returned: 8 URL.)

<?php

    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);

    while ($i++ < $times_to_run) {

        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
        $content = curl_exec($ch);

        $url = "http://store.steampowered.com/app/".$i;

        $reg = "/<\\s*img\\s+[^>]*class=['\"][^'\"]*game_header_image_full[^'\"]*['\"]/i";

        if(preg_match($reg, $content)) {
            file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
        }

    }

?>
Algernop K.
  • 477
  • 2
  • 19
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags I'll just leave this here. – Alex Dec 22 '15 at 01:13
  • @AlexanderMP I'm a scrub, this sort of works. I do apologize. – Algernop K. Dec 22 '15 at 01:14
  • 1
    No, I understand. I've used regex like this more than I like to admit. However, don't be surprised when it fails sometimes for some stupid reason. You have to then manually go through 100 HTML pages and figure out which one should match, and of those which one doesn't. I mean sure, regex is fast, and you save up to 60% of time with it, but it doesn't work exactly, that's why you use HTML parsers, that are slow and reliable. – Alex Dec 22 '15 at 01:17
  • Also, good tip for future regex use with php: Never use double quotes with regex. Why would you do that? Use single quotes and skip all those double and quadruple backslashes. No need to escape inside a string. Just escape as you would escape a regex in JS or something. – Alex Dec 22 '15 at 01:17
  • So.. The problem lies with RegEx, and there's little I can do about it? :/ – Algernop K. Dec 22 '15 at 01:18
  • You can use another HTML parser. PHP has a few built-in. Feed the response to those HTML parsers and profit! You're already using that Simple HTML DOM Parser. Just load the string using `str_get_html` – Alex Dec 22 '15 at 01:22
  • Will it be as slow as the Simple DOM HTML parser though, and can I simply replace `preg_match` with it? – Algernop K. Dec 22 '15 at 01:26
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/98624/discussion-between-alexandermp-and-john-smith). – Alex Dec 22 '15 at 01:27

1 Answers1

1

Well you shouldn't use regex with HTML. It mostly works, but when it doesn't, you have to go through hundreds of pages and figuring out which one is the failing one, and why, and correct the regex, then hope and pray that in the future nothing like that will ever happen again. Spoiler alert: it will.

Long story short, read this funny answer: RegEx match open tags except XHTML self-contained tags

Don't use regex to parse HTML. Use HTML parsers, which are complicated algorithms that don't use regex, and are reliable (as long as the HTML is valid). You are using one already, in the first example. Yes, it's slow, because it does more than just searching for a string within a document. But it's reliable. You can also play with other implementations, especially the native ones, like http://php.net/manual/en/domdocument.loadhtml.php

Community
  • 1
  • 1
Alex
  • 14,338
  • 5
  • 41
  • 59