0

I am having a problem. This is what I have to do and the code is taking extremely long to run:
There is 1 website I need to collect data from, and to do so I need my algorithm to visit over 15,000 subsections of this website (i.e. www.website.com/item.php?rid=$_id), where $_id will be the current iteration of a for loop.
Here are the problems:

  1. The method I am currently using to get the source code of each page is file_get_contents, and, as you can imagine, it takes super long to file_get_contents of 15,000+ pages.
  2. Each page contains over 900 lines of code, but all I need to extract is about 5 lines worth, so it seems as though the algorithm is wasting a lot of time by retrieving all 900 lines of it.
  3. Some of the pages do not exist (i.e. maybe www.website.com/item.php?rid=2 exists but www.website.com/item.php?rid=3 does not), so I need a method of quickly skipping over these pages before the algorithm tries to fetch its contents and waste a bunch of time.

In short, I need a method of extracting a small portion of the page from 15,000 webpages in as quick and efficient a manner as possible.
Here is my current code.

for ($_id = 0; $_id < 15392; $_id++){
    //****************************************************** Locating page
    $_location = "http://www.website.com/item.php?rid=".$_id;
    $_headers = @get_headers($_location);
    if(strpos($_headers[0],"200") === FALSE){
        continue;
    } // end if
    $_source = file_get_contents($_location);
    //****************************************************** Extracting price
    $_needle_initial = "<td align=\"center\" colspan=\"4\" style=\"font-weight: bold\">Current Price:";
    $_needle_terminal = "</td>";
    $_position_initial = (stripos($_source,$_needle_initial))+strlen($_needle_initial);
    $_position_terminal = stripos($_source,$_needle_terminal);
    $_length = $_position_terminal-$_position_initial;
    $_current_price = strip_tags(trim(substr($_source,$_position_initial,$_length)));
} // end for

Any help at all is greatly appreciated since I really need a solution to this!
Thank you in advance for your help!

Pilgerstorfer Franz
  • 8,303
  • 3
  • 41
  • 54
Alec
  • 99
  • 2
  • 10
  • Unless you can configure the remote server to just give you those 5 lines each time, you'll need to download the whole file and extract what you need. No getting around that. You can [test for its existence](http://stackoverflow.com/questions/981954/how-can-one-check-to-see-if-a-remote-file-exists-using-php) each time to avoid having to download non-existent pages, though – Clive Jan 11 '14 at 10:31
  • are those file line after particular bytes of bytes – sanjeev Jan 11 '14 at 10:35
  • you can use RollingCurl.RollingCurl allows you to process multiple HTTP requests in parallel using CURL PHP library. [link] (https://github.com/takinbo/rolling-curl) – jingyu Jan 11 '14 at 11:15

1 Answers1

2

the short of it: don't.

longer: If you want to do this much work, you shouldn't do it on demand. Do it in the background! You can use the code you have here, or any other method you're comfortable with, but instead of showing it to a user, you can save it in a database or a local file. Call this script with a cron job every x minutes (depends on the interval you need), and just show the latest content from your local cache (be it a database or a file).

Nanne
  • 64,065
  • 16
  • 119
  • 163