0

I have made this:

<html>
    <head>
        <script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
        <script>
            $(document).ready(
                function()
                {   
                    $("body").html($("#HomePageTabs_cont_3").html());
                }
            );
        </script>
    </head>
    <body>
    <?php
        echo file_get_contents("http://www.bankasya.com.tr/index.jsp");
    ?>

    </body>
</html>

When I check my page with Firebug, It gives countless "missing files" (images, css files, js files, etc.) errors. I want to have just a part of the page not of all. This code does what I want. But I am wondering if there is a better way.

EDIT:

The page does what I need. I do not need all the contents. So iframe is useless to me. I just want the raw data of the div #HomePageTabs_cont_3.

Sébastien Renauld
  • 19,203
  • 2
  • 46
  • 66
zkanoca
  • 9,664
  • 9
  • 50
  • 94
  • Seems like what you really want is a proxy server, or possibly a simple iframe. If you want to resolve the missing files you will need to either parse the HTML server side and convert all resource URLs to absolute, or do the same thing client side (server side would be easier but would likely appear slower to the user). – DaveRandom Apr 22 '13 at 09:35
  • I just want a part of the page. Does not `iframe` get all contents? I do not want `iframe`. – zkanoca Apr 22 '13 at 09:38
  • Yes, it will. Then probably you need to specify: which part of "all contents" do you want in your page.. – UltraInstinct Apr 22 '13 at 09:39
  • @OzkanOzlu You will want to load the HTML into a [`DOMDocument`](http://php.net/domdocument) and extract the content you want on the server side. A simple `file_get_contents()` won't cut it. – DaveRandom Apr 22 '13 at 09:46
  • @OzkanOzlu: just so you are aware (tagging you for it), I have just made an extensive modification to the code. It does not barf out on their (failed) HTML semantics anymore. – Sébastien Renauld Apr 22 '13 at 10:45
  • @OzkanOzlu: One more edit. Actually went all the way and reformatted the data to an instance of `stdClass` so you can do whatever you like with it :-) – Sébastien Renauld Apr 22 '13 at 11:15

3 Answers3

4

Your best bet is PHP server-side parsing. I have written a small snippet to show you how to do this using DOMDocument (and possibly tidyif your server has it, to barf out all the mal-formed XHTML foos).

Caveat: outputs UTF-8. You can change this in the constructor of DOMDocument

Caveat 2: WILL barf out if its input is neither utf-8 not iso-8859-9. The current page's charset is iso-8859-9 and I see no reason why they would change this.

header("content-type: text/html; charset=utf-8");
$data = file_get_contents("http://www.bankasya.com.tr/index.jsp");
// Clean it up
if (class_exists("tidy")) {
   $dataTidy = new tidy();
   $dataTidy->parseString($data,
                                 array(
                                       "input-encoding" => "iso-8859-9",
                                       "output-encoding" => "iso-8859-9",
                                       "clean" => 1,
                                       "input-xml" => true,
                                       "output-xml" => true,
                                       "wrap" => 0,
                                       "anchor-as-name" => false
                                 )
                          );
   $dataTidy->cleanRepair();
   $data = (string)$dataTidy;
}
else {
    $do = true;
            while ($do) {
                    $start = stripos($data,'<script');
                    $stop = stripos($data,'</script>');
                    if ((is_numeric($start))&&(is_numeric($stop))) {
                            $s = substr($data,$start,$stop-$start);
                            $data = substr($data,0,$start).substr($data,($stop+strlen('</script>')));
                    } else {
                            $do = false;
                    }
            }
    // nbsp breaks it?
    $data = str_replace("&nbsp;"," ",$data);
    // Fixes for any element that requires a self-closing tag
    if (preg_match_all("/<(link|img)([^>]+)>/is",$data,$mt,PREG_SET_ORDER)) {
            foreach ($mt as $v) {
                    if (substr($v[2],-1) != "/") {
                            $data = str_replace($v[0],"<".$v[1].$v[2]."/>",$data);
                    }
            }
    }
    // Barf out the inline JS
    $data = preg_replace("/javascript:[^;]+/is","#",$data);
    // Barf out the noscripts
    $data = preg_replace("#<noscript>(.+?)</noscript>#is","",$data);
    // Muppets. Malformed comment = one more regexp when they could just learn to write proper HTML...
    $data = preg_replace("#<!--(.*?)--!?>#is","",$data);
}
$DOM = new \DOMDocument("1.0","utf-8");
$DOM->recover = true;
    function error_callback_xmlfunction($errno, $errstr) { throw new Exception($errstr); }
    $old = set_error_handler("error_callback_xmlfunction");
// Throw out all the XML namespaces (if any)
$data = preg_replace("#xmlns=[\"\']?([^\"\']+)[\"\']?#is","",(string)$data);
try {
      $DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="utf-8"?>' : "").$data);
} catch (Exception $e) {
      $DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="iso-8859-9"?>' : "").$data);
}
    restore_error_handler();
error_reporting(E_ALL);
$DOM->substituteEntities = true;
$xpath = new \DOMXPath($DOM);
echo $DOM->saveXML($xpath->query("//div[@id=\"HomePageTabs_cont_3\"]")->item(0));

In order of appearance:

  • Fetch the data
  • If we have tidy, sanitize HTML with it
  • Create a new DOMDocument and load our document ((string)$dataTidy is a short-hand tidy getter)
  • Create an XPath request path
  • Use XPath to request all divs with id set as what we want, get the first item of the collection (->item(0), which will be a DOMElement) and request for the DOM to output its XML content (including the tag itself)

Hope it is what you're looking for... Though you might want to wrap it in a function.

Edit

Forgot to mention: http://rescrape.it/rs.php for the actual script output!

Edit 2

Correction, that site is not W3C-valid, and therefore, you'll either need to tidy it up or apply a set of regular expressions to the input before processing. I'm going to see if I can formulate a set to barf out the inconsistencies.

Edit 3

Added a fix for all those of us who do not have tidy.

Edit 4

Couldn't resist. If you'd actually like the values rather than the table, use this instead of the echo:

 $d = new stdClass();
 $rows = $xpath->query("//div[@id=\"HomePageTabs_cont_3\"]//tr");
 $rc = $rows->length;
 for ($i = 1; $i < $rc-1; $i++) {
     $cols = $xpath->query($rows->item($i)->getNodePath()."/td");
     $d->{$cols->item(0)->textContent} = array(
        ((float)$cols->item(1)->textContent),
        ((float)$cols->item(2)->textContent)
     );
 }

I don't know about you, but for me, data works better than malformed tables.

(Welp, that one took a while to write)

Sébastien Renauld
  • 19,203
  • 2
  • 46
  • 66
  • I was just researching to grab the data in the table to insert as a record using jQuery ajax. That will be good for me. Thank you so much. You read my mind. – zkanoca Apr 22 '13 at 11:27
  • 1
    I was trying to fly with a few feathers, you have given me a jet-plane. – zkanoca Apr 22 '13 at 11:59
  • Nothing wrong with cunning solutions :-) I'm considering writing a tutorial on `DOMDocument` usage and data scraping in general, by the way, seeing how many of my answers tend to gravitate towards this topic. – Sébastien Renauld Apr 22 '13 at 12:03
0

I'd get in touch with the remote site's owner and ask if there was a data feed I could use that would just return the content I wanted.

Gareth
  • 133,157
  • 36
  • 148
  • 157
0

Sébastien answer is the best solution, but if you want to use jquery you can add Base tag in head section of your site to avoid not found errors on images.

<base href="http://www.bankasya.com.tr/">

Also you will need to change your sources to absolute path.

But use DOMDocument

Community
  • 1
  • 1
Narek
  • 3,813
  • 4
  • 42
  • 58
  • Not necessarily a good idea. That page is just over a MiB in size due to three (three!) flash applets. So I don't even think the base hack is a solution in this case - so much extra wasted bandwidth. – Sébastien Renauld Apr 22 '13 at 10:28
  • @SébastienRenauld yes, this is bad solution, but still works... P.S.Grabing content from other site is always not good idea :) – Narek Apr 22 '13 at 10:33
  • sometimes, you don't have a choice. Side note, I make part of my living from writing parsers for people ;-) – Sébastien Renauld Apr 22 '13 at 10:45
  • 1
    Now I get how you wrote this complete answer in few minutes :) – Narek Apr 22 '13 at 10:51