2

I have implemented a web crawler that crawls and retrieves content from .edu TLD. The html content is being inserted into MySQL tables as the source code of the page. The script can go on for hours on a decent internet connection when a large number of seed urls are fed to the crawler. Now, my problem is that the script halts after crawling a number of links without giving any errors. I have used exception handling to handle "MySQL Server has gone away error" and has already eliminated a lot of problems and implemented if conditions that echo the errors if they are encountered. However I am not getting any errors. The problem is the halting of the script, whether I run it in the browser, Eclipse PDT or the CLI. Though it is worthy to note that the number of links crawled are somewhat different in all the three methods of running the script. I have altered the php.ini max_execution_time and other directives but this is not helping in anyway.

I have coded the script so that it resumes the crawling from where it halted, but I want the script to continue without halting so that I don't have to monitor whether the script is running or not.

Should I make changes to my Apache httpd.conf files. If yes, then what those settings should be??

The description in these links for my web crawler may help.

This is the code that retrieves html from url. This is from simple_html_dom.

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//    $contents = retrieve_url_contents($url);
if (empty($contents))
{
    return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}

Here is the error log for the following links:

And the crawler stopped after crawling this link:

[01-Jan-2012 22:54:39] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:39] PHP Warning: file_get_contents(http://lms.nust.edu.pk) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:41] PHP Warning: file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated twice) ...

[01-Jan-2012 22:55:58] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ipo) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:58] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#tto) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:59] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ilo) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:59] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#mco) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:56:05] PHP Warning: file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated 18 times) ...

[01-Jan-2012 22:57:33] PHP Warning: file_get_contents(http://www.nust.edu.pk/#ctl00_SiteMapPath1_SkipLink) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:57:33] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 22:57:55] PHP Warning: file_get_contents(http://www.harvard.edu/#skip) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:21] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#undergrad) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:22] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#grad) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:24] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#continue) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:25] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#summer) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:04] PHP Warning: file_get_contents(http://www.harvard.edu/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated 1 time) ...

[01-Jan-2012 23:00:11] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:00:41] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:41] PHP Warning: file_get_contents(http://directory.berkeley.edu) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:47] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:01:53] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:53] PHP Warning: file_get_contents(http://students.berkeley.edu/uga/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning: file_get_contents(http://publicservice.berkeley.edu/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning: file_get_contents(http://students.berkeley.edu/osl/leadprogs.asp) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:17] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:02:25] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:25] PHP Warning: file_get_contents(http://bearfacts.berkeley.edu/bearfacts) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning: file_get_contents(http://career.berkeley.edu/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

And this is the error log from php-cgi.exe:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: php-cgi.exe
  Application Version:  5.3.8.0
  Application Timestamp:    4e537939
  Fault Module Name:    php5ts.dll
  Fault Module Version: 5.3.8.0
  Fault Module Timestamp:   4e537a04
  Exception Code:   c0000005
  Exception Offset: 0000c793
  OS Version:   6.1.7601.2.1.0.256.48
  Locale ID:    1033
  Additional Information 1: 0a9e
  Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
  Additional Information 3: 0a9e
  Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

Please help me in this regard.

Community
  • 1
  • 1
Rafay
  • 6,108
  • 11
  • 51
  • 71
  • Did you set `error_reporting()` and `display_errors`? Dod you changed recursion to flat list in your code? – piotrekkr Jan 01 '12 at 22:59
  • Yes I did change it. And I have made sure that "MySQL server errors are eliminated. – Rafay Jan 01 '12 at 23:01
  • Don't use file_get_contents(); use curl to get webpage contents because curl is better suited to do things like that. Insert `erro_log('what's going on')` inside your code and try again to see when exacly script crashed. You could also dump memory usage into error log. – piotrekkr Jan 01 '12 at 23:22
  • @piotrekkr How am I supposed to change the code at this point to use curl instead of simple_html_dom?? – Rafay Jan 01 '12 at 23:29
  • @piotrekkr yeah this is obviously my code but I am comfortable with simple_html_dom and has never used curl. I would need to alter a lot of code for that. If there are no other alternatives, can you explain what I need to know about curl. And seeing the log file, can you suggest what is causing this?? – Rafay Jan 01 '12 at 23:35

1 Answers1

2

you should check call stack of php process (if running as CGI or CLI) or apache httpd process(if run as mod_php).

Then you will see in which module/procedure are execution halted. Also you can check active TCP/IP connection made by your script, maybe there is some ongoing IO operation which caused your script to halted.

I hope this helps.

rkosegi
  • 14,165
  • 5
  • 50
  • 83
  • Can you please elaborate further especially the TCP/IP connection part. – Rafay Jan 01 '12 at 22:49
  • This depends which OS your server is running.Just a note, do you have this problem with all sites you're try to crawl, or just specific ones? – rkosegi Jan 01 '12 at 22:54
  • Well the trend of halting is somewhat constant I must say. Like the crawling stops after a certain link. But when I refresh the script in the browser, It resumes. And there are one or more such links in nearly every site. However, the exception is not thrown, otherwise I would have known which link caused it.I am still unclear whether there is a problem with links or is it somewhere in the configurations. I am waiting for the log to complete and I will post it here. It would be very kind of you to please check that out. – Rafay Jan 01 '12 at 23:00
  • 1
    You are connecting using HTTP proxy?Which function you are using to contact URL? fopen? Maybe you should setup shorter time-out values. – rkosegi Jan 01 '12 at 23:17
  • I am using simple_html_dom and has passed the context like this: `$CFG = new stdClass(); $CFG->context = array ( 'http'=>array ( 'proxy'=>'10.6.14.6:8080', 'request_fulluri'=>true, ), ); $CFG->finalContext = stream_context_create($CFG->context);` – Rafay Jan 01 '12 at 23:23
  • Whatever you are suggesting, can you please guide me a bit, keeping in mind that I am using simple html dom. – Rafay Jan 01 '12 at 23:29
  • 1
    I think you should check documentation of simple_html_dom especially about "context", I think there should be settings like time-out.Try to set one second or something like this.By the way, did you try to debug it?You can download Zend Studio trial (if you don't have it) and debug it there.Then you can debug and profile your code and found a bottleneck.I hope this helps you. – rkosegi Jan 01 '12 at 23:33
  • Yes I have tried to debug it. But if the code stuck at 500th link in a foreach loop and doesn't give any exception, how would I do that. I am using Eclipse PDT. – Rafay Jan 01 '12 at 23:36
  • Please check the simple_html_dom function. I have included it in the updated question. – Rafay Jan 01 '12 at 23:44