1

I have this link: Alchemilla vulgaris. It is a google image link for images about a certain herb and I want to search the code of this web page for <div> tags with attribute data-id and extract the data id using preg_match_all.

I have this code but it does not show any results. I think the problem is in regular expression. Can you please help me get it right.

<!DOCTYPE HTML>
<html lang="sk">
    <head>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Image searcher</title>
    </head>
    <body>
        <?php
            $search_query = "Alchemilla vulgaris";
            $search_query = urlencode( $search_query );
            $url = "https://www.google.com/search?q=$search_query&tbm=isch&ved=2ahUKEwi_0dbpjJrxAhUU_BoKHVkmDOwQ2-cCegQIABAA&oq=$search_query"; 
            echo $url;
            echo "\n"; 
            $html = file_get_contents( $url );
            preg_match_all('#<div\s.*?(?:data-id=[\'"](.*?)[\'"]).*?>#is',$html, $matches );
            var_dump($matches);
        ?>
    </body>
</html>

Thank you

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Why not use the onboard functions of `DOMDocument`? Nobody is using regex for that. See https://stackoverflow.com/questions/26240471/simple-dom-php-parse-get-custom-data-attribute-value – Daniel W. Jul 12 '21 at 11:29

1 Answers1

1

First and foremost a necessary initiation if you haven't seen this https://stackoverflow.com/a/1732454/4907162

So yes, as pointed out in comments a true DOM/XML parser would be much more appropriate. Also regex has a time and place for its usage ... HTML parsing with regex really isn't the best thing out there but of course do-able for some things.

A few points to note:

Google doesn't like bots scraping it - you might even get asked to solve a (re-)?captcha if you look like a bot. So at this time (this may change in the future maybe?) if your User-Agent doesn't match a "friendly" known UA then you get filtered out and get a different HTML result. I'm sure you may have done an echo $html; just to see you were getting content but if you manually search you will see the data generated does not include the data-id string you're trying to find.

So for your situation using the PHP function file_get_contents you'll want to do something like :

$opts = array('http' =>
  array(
    'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67'
  )
);

$context  = stream_context_create($opts);

$html = file_get_contents( $url, false, $context );

For the regex it's a small change to make :

preg_match_all('#<div\s[^>]*?(?:data-id=[\'"](.*?)[\'"]).*?>#is',$html, $matches );

While I was trying to simply get the script to work at all I ended up creating this regex if you'd like to see another way.

preg_match_all('#<div\s+[^>]+data-id=[\'"]([^\'"]+)[\'"][^>]*>#is', $html, $matches )

Tim Toady Bicarbonate


To answer your comment in a way that I was able to find - maybe someone else can elaborate more:

In PHP, the context provided to file_get_contents allows to add additional information to call information from a HTTP/URL.

If you were to test file_get_contents on a URL for a server you own, you might notice in the logs the User-Agent is empty. At least on the server I'm using the User-Agent is an empty string. The context allows for specifying a User-Agent passed to the server you're trying to pull data from.

The server you're pulling data from processes the rest of the information. In the case of calling information from Google - they do check User-Agent information. You'll want to use a "known friendly" (as I call it) User-Agent.

The context of a stream allows to provide information that the server expects to see. Or at least that's what I can describe for PHP in context of file/url resource reading.

I hope this helps. I'll admit I'm not sure how to respond with more useful information.

ŽaMan
  • 396
  • 5
  • 17
  • 1
    Thank you very much. Eventually I decide to use DOMDocument but I found you response very useful. Thank you – Oliver Kurnava Jul 12 '21 at 18:12
  • 1
    I have just one last question. The first 5 lines of code you send look like a declaration. I see it helps but I do not understand it. Can you please tell me how it works or the name of it so I can google more about it by myself. Thank you. – Oliver Kurnava Jul 23 '21 at 11:52
  • Of course, it's a good question. You'll want to initially refer to information provided from https://www.php.net/manual/en/function.stream-context-create.php - the rest can be found in the PHP sources for what it's actually doing. I'll update the answer. – ŽaMan Jul 25 '21 at 21:12