0

I want to create a universal website crawler using PHP.

By using my web application, a user will input any URL, will provide input on what he needs to get from given site and will click on Start button.

Then my web application will begin to get data from source website.

I am loading the page in iframe and using jQuery I get class and tags name of specific area from user.

But when I load external website like ebay or amazon etc it does not work, as these site are restricted. Is there any way to resolve this issue, so I can load any site in iFrame? Or is there any alternative to what I want to achieve?

I am actually inspired by mozenda, a software developed in .NET, http://www.mozenda.com/video01-overview/.

They load a site in a browser control and it's almost the same thing.

Mike Beeler
  • 4,081
  • 2
  • 29
  • 44
Adnan
  • 571
  • 5
  • 15
  • @mc10 Thank you. That's a very relevant link if this user continues to try to crawl this site via client-side functionality, but it doesn't really answer their core question: building a website crawler. So, I don't really think it's a duplicate for that reason. – Homer6 Jan 02 '14 at 06:25

3 Answers3

2

You can't crawl a site on the client-side if the target website is returning the "X-Frame-Options: SAMEORIGIN" response header (see @mc10's duplicate link in the question comments). You must crawl the target site using server-side functionality.

The following solution might be suitable if wget has all of the options that you need. wget -r will recursively crawl a site and download the documents. It has many useful options, like translating absolute embedded urls to relative, local ones.

Note: wget must be installed in your system for this to work. I don't know which operating system you're running this on, but on Ubuntu, it's sudo apt-get install wget to install wget.

See: wget --help for additional options.

<?php

    $website_url = $_GET['user_input_url'];

    //doesn't work for ipv6 addresses
    //http://php.net/manual/en/function.filter-var.php
    if( filter_var($website_url, FILTER_VALIDATE_URL) !== false ){

        $command = "wget -r " + escapeshellarg( $website_url );
        system( $command );

        //iterate through downloaded files and folders

    }else{
        //handle invalid url        

    }
Homer6
  • 15,034
  • 11
  • 61
  • 81
1

Take a look at using the file_get_contents function in PHP.

You may have better success in retrieving the HTML for a given site like this:

$html = file_get_contents('http://www.ebay.com');
duellsy
  • 8,497
  • 2
  • 36
  • 60
1

You can sub in what element you're looking for in the second foreach loop within the following script. As is the script will gather up the first 100 links on cnn's homepage and put them in a text file named "cnnLinks.txt" in the same folder in which this file is located.

Just change the $pre, $base, and $post variables to whatever url you want to crawl! I separated them like that to change through common websites faster.

<?php
    set_time_limit(0);
    $pre = "http://www.";
    $base = "cnn";
    $post = ".com";
    $domain = $pre.$base.$post;
    $content = "google-analytics.com/ga.js";
    $content_tag = "script";
    $output_file = "cnnLinks.txt";
    $max_urls_to_check = 100;
    $rounds = 0;
    $domain_stack = array();
    $max_size_domain_stack = 1000;
    $checked_domains = array();
    while ($domain != "" && $rounds < $max_urls_to_check) {
        $doc = new DOMDocument();
        @$doc->loadHTMLFile($domain);
        $found = false;
        foreach($doc->getElementsByTagName($content_tag) as $tag) {
            if (strpos($tag->nodeValue, $content)) {
                $found = true;
                break;
            }
        }
        $checked_domains[$domain] = $found;
        foreach($doc->getElementsByTagName('a') as $link) {
            $href = $link->getAttribute('href');
            if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
                $href_array = explode("/", $href);
                if (count($domain_stack) < $max_size_domain_stack &&
                    $checked_domains["http://".$href_array[2]] === null) {
                    array_push($domain_stack, "http://".$href_array[2]);
                }
            };
        }
        $domain_stack = array_unique($domain_stack);
        $domain = $domain_stack[0];
        unset($domain_stack[0]);
        $domain_stack = array_values($domain_stack);
        $rounds++;
    }

    $found_domains = "";
    foreach ($checked_domains as $key => $value) {
        if ($value) {
            $found_domains .= $key."\n";
        }
    }
    file_put_contents($output_file, $found_domains);
?>
les
  • 564
  • 7
  • 19
  • i just want something more before crawling.. get information from user example: user input site url and the site will load in iframe user select spacific areas like price and title of product so i will get the class name and tag name of that spacified selection and save them in a file and use that information to crawl the whole site.... – Adnan Jan 02 '14 at 07:23
  • i just want something more before crawling.. get information from user example: user input site url and the site will load in iframe user select spacific areas like price and title of product so i will get the class name and tag name of that spacified selection and save them in a file and use that information to crawl the whole site.... – Adnan Jan 02 '14 at 07:30