1

I tried to parse some html page :

<div class="gs_r"><h3 class="gs_rt"><span class="gs_ctc">[BOOK]</span> <a href="http://exampleA.com" onmousedown="return scife_clk(this.href,'','res','1')">titleA</a></h3><div class="gs_ggs gs_fl"><a href="http://exampleApdf.pdf" onmousedown="return scife_clk(this.href,'gga','gga','1')">
<div class="gs_r"><h3 class="gs_rt"><span class="gs_ctc">[BOOK]</span> <a href="http://exampleB.com" onmousedown="return scife_clk(this.href,'','res','1')">titleB</a></h3><div class="gs_ggs gs_fl"><a href="http://exampleB.doc" onmousedown="return scife_clk(this.href,'gga','gga','1')">

From that html page, we can get informations: links of pages (http://exampleA.com,http://exampleB.com), titles (titleA, titleB), links of documents (http://exampleApdf.pdf,http://exampleB.doc) But, I just want to get the informations of documents that have pdf link. so from that example, I just want to get : http://exampleA.com, titleA, http://exampleApdf.pdf. I've trying, but it gives me blank result. How can I them? thank you ! :) here's the code :

<?php

include 'simple_html_dom.php';
$url = 'http://scholar.google.com/scholar?hl=en&q=data+mining&btnG=&as_sdt=1%2C5&as_sdtp=';
$html = file_get_html($url);
foreach($html->find('div[class=gs_ggs gs_fl]')as $pdfLink){
    if (preg_match('/\.pdf$/i', $pdfLink)) {
       $html2->find('span[class=gs_ctc]');
       echo $html2.$pdfLink;
    }
 }

?>
bruine
  • 647
  • 5
  • 16

1 Answers1

0

You cannot determine from the URL what kind of resource will be returned.

Not everyone serves up PDF files with .pdf extensions. Not all web services reveal the file names of files on disk. Only the Content-Type HTTP response header should be used for determining the type of the resource.

You can get this efficiently by doing a HEAD request for each URL you find.

Community
  • 1
  • 1
Brad
  • 159,648
  • 54
  • 349
  • 530
  • oh yes, thank you, I'll learn about it. but is it ok if I combine curl and simple_html_dom in the same time? cause I need to get the informations of links and titles too.. – bruine Jul 18 '12 at 01:27
  • 1
    @igos, Yes, absolutely. Remember to set a timeout on your cURL requests. – Brad Jul 18 '12 at 02:09