-1

I have a screen scraping script in PHP on a GoDaddy shared LAMP server running via command-line.

The script scrapes, parses and stores the required information in a database. It takes about 1.5 seconds for the entire process per page, and needs to scrape close to 10,000 pages (and for each of the pages, fetch cookies from two others, making it a total of 30k pages that are curled).

The entire script will take about 5 hours to run. I have done some memory profiling, and memory consumption stays more or less constant throughout the run - it does not increase.

If I were to run the script overnight, would GoDaddy notice something abnormal about it? CPU consumption should not be too much but how bad would the bandwidth consumption of fetching 3 pages per 1.5 seconds for a duration of 5 hours be? Enough to raise alarms on GoDaddy's end?

If yes, I suppose I could break up the script to run through 1500 pages, and then halt for one hour and then resume. Should I do that?

Ayush
  • 41,754
  • 51
  • 164
  • 239
  • 1
    Try it out, Godaddy will tell you. You can then make up your mind later on how to solve the concrete (not imaginary) issue. – hakre Jan 12 '12 at 10:05
  • "If I were to run the script overnight, would GoDaddy notice something abnormal about it? CPU consumption should not be too much but how bad would the bandwidth consumption of fetching 3 pages per 1.5 seconds for a duration of 5 hours be? Enough to raise alarms on GoDaddy's end?" GoDaddy doesn't have to wait for you to run it. They can see you're going to do so by reading about it here. Screen scraping is a very poor way to gather information, If you are going to run against 30K pages, don't you think it'd be better to look for an API or data source from the sites? – the Tin Man Jan 13 '12 at 10:02
  • 1
    @theTinMan I doubt they scan every question on StackOverflow in the off chance that someone might mention something. Also, I'm not worried they'd shut down my script because it is "unethical". I was worried it might consume resources that raise alarms. Anyways, I ran it last night and had no issues, so it's all good. – Ayush Jan 13 '12 at 10:41
  • 1
    PS - There is no API. I was scraping my Univ's course catalog to provide a RESTful API to others. – Ayush Jan 13 '12 at 10:41

1 Answers1

0

For the sake of not leaving the question unanswered, I'll post the answer:

I ran the script overnight. It took about 5 hours to run and it was neither terminated by GoDaddy nor did I receive any notice, so I guess it was fine with them.

Initially I was having memory issues where the script would run out of the memory allocated to me, but apparently that was a pre-PHP 5.3 bug (more details on that here). Once fixed, it hovered at 32-34MB RAM usage the entire while. No clue about CPU comsumption or bandwidth usage.

Community
  • 1
  • 1
Ayush
  • 41,754
  • 51
  • 164
  • 239
  • I would like to know why @theTinMan feels that "Screen scraping is a very poor way to gather information." Capturing existing data and using it in a new way for your specific users benefit seems beneficial and certainly a productive way of obtaining and presenting data. Is the assumption that you are stealing and therefore the only proper course is to go through the companies legal council and ask for permission (if in fact no API exists)? Sure, if you're profiting from their work you might need to pay, but before then it's akin to asking permission to play a record at your party. No? – Ricalsin Jan 16 '12 at 21:45