2

I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.

I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.

I am posting the code below and will try to explain the problems.

private function initiateChildCrawler($parent_Url_Html) {

    global $CFG;
    static $foundLink;
    static $parentID;
    static $urlToCrawl_InstanceOfChildren;

    $forEachCount = 0;
    foreach($parent_Url_Html->getHTML()->find('a') as $foundLink) 
    {
        $forEachCount++;
        if($forEachCount<500) {
        $foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);

        if($this->validateEduDomain($foundLink->href)) 
        {
            //Implement else condition later on
            $parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
            if($parentID != FALSE) 
            {
                if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
                {
                    $urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
                    if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
                    {
                        $this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
                        $this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);

                        /*if($recursiveCount<1)
                        {
                            $this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
                        }*/
                    }
                }
            }
        }
        }
    }   
}

Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.

for eg: 1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler). Moves to the next parent in parentCrawler. 2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).

Other functions are self explanatory.

Now the problem: After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.

But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.

Please help me in this regard.

Message from CLI:

Problem signature: Problem Event Name: APPCRASH Application Name: php-cgi.exe Application Version: 5.3.8.0 Application Timestamp: 4e537939 Fault Module Name: php5ts.dll Fault Module Version: 5.3.8.0 Fault Module Timestamp: 4e537a04 Exception Code: c0000005 Exception Offset: 0000c793 OS Version: 6.1.7601.2.1.0.256.48 Locale ID: 1033 Additional Information 1: 0a9e Additional Information 2: 0a9e372d3b4ad19135b953a78882e789 Additional Information 3: 0a9e Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

hakre
  • 193,403
  • 52
  • 435
  • 836
Rafay
  • 6,108
  • 11
  • 51
  • 71
  • Please post the error message from the CLI crash. – Johnny Graber Dec 31 '11 at 13:06
  • 3
    The problem is you are using too much memory without releasing it because your method can potentially recurse up to $forEachCount levels deep, meaning up to $forEachCount full documents in memory. You should convert this to a flat loop. Keep a master list of urls, as well as master list of "processed urls" then iterate over the unprocessed ones, loading each page once and adding links to the master list. Then you will only have one document in memory at once. Stop when the master list reaches a desired length. – Ben Lee Dec 31 '11 at 13:08
  • $Ben Lee I got your point...but can you explain me the term flat loop? – Rafay Dec 31 '11 at 13:13
  • Well now what I am doing is having an array of absolute URLS for which the html is to be retrieved. The problem now is that after retrieving and inserting about 50-60 URLS and their html content, I am getting the "MySQL server has gone away" error. But if I limit the number of URLS to be retrieved to some less number, say about 10-15, the error goes away. Please help me in this regard. – Rafay Dec 31 '11 at 17:36
  • When you change your code to use flat list, try `sleep(1)` after parsing single url. Maby your db server is weak and can't handle many queries at once. Additionaly you could `echo` some debug info in various functions to see memory usage, number of parsed urls etc. – piotrekkr Jan 01 '12 at 22:58

1 Answers1

1

Flat Loop Example:

  1. You initiate the loop with a stack that contains all URLs you'd like to process first.
  2. Inside the loop:
    1. You shift the first URL (you obtain it and it's removed) from the stack.
    2. If you find new URLs, you add them at the end of the stack (push).

This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:

$URLStack = (array) $parent_Url_Html->getHTML()->find('a');
$URLProcessedCount = 0;
while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = array_shift($URLStack);
    if (!$url) break; # exit if the stack is empty

    # process URL

    # for each new URL:
    $URLStack[] = $newURL;
}

You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you've already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:

while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = $URLStack[$URLProcessedCount++];

Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it's a much more versatile tool.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • Well now what I am doing is having an array of absolute URLS for which the html is to be retrieved. The problem now is that after retrieving and inserting about 50-60 URLS and their html content, I am getting the "MySQL server has gone away" error. But if I limit the number of URLS to be retrieved to some less number, say about 10-15, the error goes away. Please help me in this regard. – Rafay Dec 31 '11 at 17:36
  • Your database refuses to work on with your script and has dropped the connection. That can have various reasons, you should contact your database administrator to find out what is going on exactly, you probably hit some limits with your database provider. – hakre Dec 31 '11 at 17:56
  • I am the database administrator myself. The code and the database are deployed on my own pc and I don't know what to do to make things right here. The irony is that the database is working fine with less number of links (about 40-50) but returns error on greater numbers. If script was the problem, the error would have appeared on the very first link. – Rafay Dec 31 '11 at 18:01
  • If you're the database admin, enable DB logging and check what's going on. Probably something invalid happens which can be solved by re-configuring mysql (or whichever database you're using). – hakre Dec 31 '11 at 18:06