0

I am writing a script that will probably need half a day because it gets data from about 14000 webpages from a website.

To find out whether it makes progress or not, is there any way to observe its execution, i.e. the outgoing connections to each of the scraped pages, with the mac os shell?

I am using curl to get the page contents, if that is of any help.

Thanks a lot! Charles

EDIT The script is written in php and executed from localhost.

Dennis Hackethal
  • 13,662
  • 12
  • 66
  • 115
  • Hmm, this wouldn't happen to be aimed at... http://www.bandliste.de/, would it? I hope that this activity is sanctioned by the site that you're doing this to. – Jared Farrish Jun 10 '12 at 12:34
  • If you ran a website where you had a lot of information, would you be happy if someone suddenly hit it and tried to download the entire site, and you had no idea who it was or what they were up to? Or suddenly found it copied somewhere else? – Jared Farrish Jun 10 '12 at 12:42
  • Of course, your biggest issue in this endeavor is probably your choice of using PHP to do it. Anyway, [this answer](http://stackoverflow.com/a/2215188/451969) might point to something useful. – Jared Farrish Jun 10 '12 at 12:46

2 Answers2

0

When writing custom scripts it is very helpful to output some sort of status to stdout.

this can be done in a uniform way using printf http://www.php.net/manual/en/function.sprintf.php

What you log to stdout depends on what information you need to see. Perhaps for a curl request I would log Url, Response code, maybe start time and end time. Its really up to you, just make sure you can verfiy it's status/progress.

printf('%40s | %5s', 'URL', 'Status Code');
printf('%40s | %5s', $the_url, $status_code);
dm03514
  • 54,664
  • 18
  • 108
  • 145
  • Thank you - will this cause the script to take a considerable amount of more time to execute? – Dennis Hackethal Jun 10 '12 at 12:40
  • It will take longer yes, how much longer is probably neglibable for the value of the information it provides. The thing is though if you are scraping 140,000 urls it is good to know what is going on and have a log, wheather you save this output to a file `> output.txt` or whether you check your db for links that have been completed it is good to have a status on what went well and what failed. You might also be able to set `curl_setopt($session, CURLOPT_VERBOSE, true); // Display communication with server` but i don't know if that information would be useful to you. – dm03514 Jun 10 '12 at 12:42
0

If you are running this via a web browser, output is not seen until the PHP has finished executing. However, file_put_contents() can append data to a logfile which you can look at.

An example line of code would be: file_put_contents("file name.txt", "\nWebsite abc was successfully scraped", FILE_APPEND);. You must have the FILE_APPEND flag or the PHP will just overwrite the file each time.

php.net Reference

Scott Stevens
  • 2,546
  • 1
  • 20
  • 29