1
<?php
    $page = file_get_contents("https://www.google.com");

    preg_match('#<div id="searchform" class="jhp big">(.*?)</div>#Uis', $page, $matches);

    print_r($matches);
    ?>

The following code I wrote, has to grab a specific part of another web page (in this case google). Unfortunately it is not working, and I'm not sure why (since the regular expression itself is grabbing everything inside of the div).

Help would be appreciated!

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
  • 1
    You should use an HTML/XML parser when working with HTML. Regular expressions are a *general* solution, parsers are purpose-built solutions. Always use purpose-built solutions if they exist. – Sverri M. Olsen Aug 20 '15 at 08:54
  • Does it work on your own site? – B001ᛦ Aug 20 '15 at 08:54
  • `U` reverses greedy/lazy quantificators, and this is the problem here, I guess. Remove the `U` modifier, and `.*?` will match as few characters as possible. However, why using a regex to fetch the HTML tag contents? Use DOM parser. – Wiktor Stribiżew Aug 20 '15 at 09:01

2 Answers2

2

According to the source of the page you have pasted, there does not exist a line with that structure. This is one of the reasons why parsing HTML with regalar expressions is not recommended.

Using the getElementById() seems to do what you are after:

<?php
$page = file_get_contents("https://www.google.com");

$doc = new DOMDocument();
$doc->loadHTML($page);
$result = $doc->getElementById('searchform');

print_r($result);
?>

EDIT:

You could use the code below:

<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://google.com');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);


$page = curl_exec($curl);
curl_close($curl);

$doc = new DOMDocument();
$doc->loadHTML($page);
echo($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>

You might need to refer to this question though since you might need to change some settings.

Community
  • 1
  • 1
npinti
  • 51,780
  • 5
  • 72
  • 96
  • I'm receiving the error 404 file not found / 302 file moved. On every single website i'm trying – Testuser075 Aug 20 '15 at 09:09
  • @Testuser075: And you where not getting this error before? Because that is the same method you were using originally. – npinti Aug 20 '15 at 09:15
  • Not when I directly echoed out the output retrieved from the file_get_contents – Testuser075 Aug 20 '15 at 09:23
  • @Testuser075: I've edited my question some time ago, I do not know if it helps you with your issue. – npinti Aug 20 '15 at 12:03
  • I didn't notice you have edited the post. Unfortunately though, it did not help me come further.. I'd like to write this down using regex (if possible). The other solution you've posted was great aswell (DOMDocument) but can't find a solution to get it working :( – Testuser075 Aug 20 '15 at 13:24
  • @Testuser075: The problem is that regular expressions aren't sophisticated enough to parse HTML, so although you could come up with something which works, it would be rather brittle, and that would be me giving bad advice. Is there any particular reason why you would like to push using regular expressions? – npinti Aug 20 '15 at 13:29
  • There is no specific reason for, I just can't seem to understand CURL. Luckily though I understand the first option you've posted. The only reason is that i'm not able to get it working. I keep receiving the 302 "content moved" option (on every single website) – Testuser075 Aug 20 '15 at 13:31
  • @Testuser075: From what I have found (not really a PHP user myself) the `file_get_contents` sometimes gets stuck when the file is moved. I researched a bit and it seems that `curl` is more flexible when it comes to redirections. Essentially both of them try to give you the HTML content of the page. – npinti Aug 20 '15 at 13:46
1

DomxPath would be a better choice for you, here is an example.

<?php

$content = file_get_contents('https://www.google.com');

//gets rid of a few things that domdocument hates
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&amp;', $content);

$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);


$item = $xpath->query('//div[@id="searchform"]');
Exploit
  • 6,278
  • 19
  • 70
  • 103