4

Goal: I want to scrape the word "Paris" inside an iframe using cURL.

Say you have a simple page containing an iframe:

<html>
<head>
<title>Curl into this page</title>
</head>
<body>

<iframe src="france.html" title="test" name="test">

</body>
</html>

The iframe page:

<html>
<head>
<title>France</title>
</head>
<body>

<p>The Capital of France is: Paris</p>

</body>
</html>

My cURL script:

<?php>

// 1. initialize

$ch = curl_init();

// 2. The URL containing the iframe

$url = "http://localhost/test/index.html";

// 3. set the options, including the url

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// 4. execute and fetch the resulting HTML output by putting into $output

$output = curl_exec($ch);

// 5. free up the curl handle

curl_close($ch);

// 6. Scrape for a single string/word ("Paris") 

preg_match("'The Capital of France is:(.*?). </p>'si", $output, $match);
if($match) 

// 7. Display the scraped string 

echo "The Capital of France is: ".$match[1];

?>

Result = nothing!

Can someone help me find out the capital of France?! ;)

I need example of:

  1. parsing/grabbing the iframe url
  2. curling the url (as I've done with the index.html page)
  3. parsing for the string "Paris"

Thanks!

tony gil
  • 9,424
  • 6
  • 76
  • 100
ven
  • 395
  • 3
  • 9
  • 17
  • This is not a cURL script, it's a PHP script. Don't confuse it with the library. And don't parse HTML with regex! – sidyll Dec 07 '11 at 00:01
  • 1
    I don't see the part where you're loading the iframe. You first have to scrape the index page for any iframes, then load and scrape each of those. (ps as per [this question](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) you should use [DOMDocument->loadHTML()](http://docs.php.net/manual/en/domdocument.loadhtml.php) for HTML parsing with PHP and not regular expressions) – CanSpice Dec 07 '11 at 00:02
  • 1
    Can you like, accept any answers? – FailedDev Dec 07 '11 at 00:10
  • I just accepted all answers to my previous questions - thanks for pointing that out! – ven Dec 07 '11 at 00:24

3 Answers3

4

note that occasionally for a variety of reasons the iframe curl can't be read outside the context of their own server and looking at the curl directly throws some type of 'can't be read directly or externally' error message.

in these cases, you can use curl_setopt($ch, CURLOPT_REFERER, $fullpageurl); (if you're in php and reading the text using curl_exec) and then curl_exec thinks the iframe is in the original page and you can read the source.

so if for whatever reason france.html couldn't be read outside the context of the larger page that included it as an iframe, you can still get the source using methods above using CURLOPT_REFERER and setting the main page (test/index.html in the original question) as the referrer.

Barry
  • 116
  • 3
3

--Edit-- You could load the page contents into a string, parse the string for iframe, then load the iframe source into another string.

$wrapperPage = file_get_contents('http://localhost/test/index.html');

$pattern = '/\.*src=\".*\.html"\.*/';

$iframeSrc = preg_match($pattern, $wrapperPage, $matches);

if (!isset($matches[0])) {
    throw new Exception('No match found!');
}

$src = $matches[0];

$src = str_ireplace('"', '', $src);
$src = str_ireplace('src=', '', $src);
$src = trim($src);

$iframeContents = file_get_contents($src);

var_dump($iframeContents);

--Original--

Work on your acceptance rate (accept answers to previously answered questions).

The url you are setting the curl handler to is the file wrapping the i-frame, try setting it to the url of the iframe:

$url = "http://localhost/test/france.html";
Mike Purcell
  • 19,847
  • 10
  • 52
  • 89
  • I guess the main problem is I don't know how to scrape the link of the iframe then fetch that then scrape that! Any examples would be appreciated. – ven Dec 07 '11 at 00:14
  • When I curl the iframe page (france.html) everything works fine. I need a way to point it to the index.html first - so I need to do a "curl within a curl" – ven Dec 07 '11 at 00:35
  • giving it a try now but running into: Warning: preg_match() [function.preg-match]: Compilation failed: nothing to repeat at offset 10 in /Applications/XAMPP/xamppfiles/htdocs/curl/1197846/w3.php on line 7 Fatal error: Uncaught exception 'Exception' with message 'No match found!' in /Applications/XAMPP/xamppfiles/htdocs/curl/1197846/w3.php:10 Stack trace: #0 {main} thrown in /Applications/XAMPP/xamppfiles/htdocs/curl/1197846/w3.php on line 10 – ven Dec 07 '11 at 00:46
  • @Dri: Try my code, file_get_contents in place of your curl calls. Curl may not be necessary in this case. According to PHP docs, file_get_contents can read in contents of remote files: http://us2.php.net/file_get_contents – Mike Purcell Dec 07 '11 at 00:51
  • 1
    @Dri: Try `var_dump($wrapperPage)` after it gets initialized, see if there is at least content. – Mike Purcell Dec 07 '11 at 01:20
  • @Dri: Updated code to reflect a change with $pattern, give it a shot. – Mike Purcell Dec 07 '11 at 01:35
  • looks like this is not finding the iframe link: $pattern = '/\.*src=\"[a-z]+\.html"\.*/' – ven Dec 07 '11 at 01:44
  • index code:

    What is the capitol of France?

    – ven Dec 07 '11 at 01:45
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/5640/discussion-between-dri-and-digital-precision) – ven Dec 07 '11 at 01:46
2

To answer your question, your pattern does not match the input text:

          <p>The Capitol of France is: Paris</p>

You have an extra space before the closing paragraph tag, which can never match:

preg_match("'The Capitol of France is:(.*?). </p>'si"

You should have the space before the capture group and remove the redundant . thereafter:

preg_match("'The Capitol of France is: (.*?)</p>'si"

To use optional space at any of the two positions, use \s* instead:

preg_match("'The Capitol of France is:\s*(.*?)\s*</p>'si"

You could also make the capture group only match letters with (\w+) to be more specific.

mario
  • 144,265
  • 20
  • 237
  • 291