First and foremost a necessary initiation if you haven't seen this https://stackoverflow.com/a/1732454/4907162
So yes, as pointed out in comments a true DOM/XML parser would be much more appropriate. Also regex has a time and place for its usage ... HTML parsing with regex really isn't the best thing out there but of course do-able for some things.
A few points to note:
Google doesn't like bots scraping it - you might even get asked to solve a (re-)?captcha
if you look like a bot. So at this time (this may change in the future maybe?) if your User-Agent
doesn't match a "friendly" known UA
then you get filtered out and get a different HTML result. I'm sure you may have done an echo $html;
just to see you were getting content but if you manually search you will see the data generated does not include the data-id
string you're trying to find.
So for your situation using the PHP function file_get_contents
you'll want to do something like :
$opts = array('http' =>
array(
'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67'
)
);
$context = stream_context_create($opts);
$html = file_get_contents( $url, false, $context );
For the regex it's a small change to make :
preg_match_all('#<div\s[^>]*?(?:data-id=[\'"](.*?)[\'"]).*?>#is',$html, $matches );
While I was trying to simply get the script to work at all I ended up creating this regex if you'd like to see another way.
preg_match_all('#<div\s+[^>]+data-id=[\'"]([^\'"]+)[\'"][^>]*>#is', $html, $matches )
Tim Toady Bicarbonate
To answer your comment in a way that I was able to find - maybe someone else can elaborate more:
In PHP, the context provided to file_get_contents allows to add additional information to call information from a HTTP/URL.
If you were to test file_get_contents on a URL for a server you own, you might notice in the logs the User-Agent is empty. At least on the server I'm using the User-Agent is an empty string. The context allows for specifying a User-Agent passed to the server you're trying to pull data from.
The server you're pulling data from processes the rest of the information. In the case of calling information from Google - they do check User-Agent information. You'll want to use a "known friendly" (as I call it) User-Agent.
The context of a stream allows to provide information that the server expects to see. Or at least that's what I can describe for PHP in context of file/url resource reading.
I hope this helps. I'll admit I'm not sure how to respond with more useful information.