Fetching large chunks of data from online content.

Question

There is a publicly created yellowpage website that I would like to download contents from. Basically, every single entry of the website's 20,000 content is indexed by a number after the URL. ex: foo.com/22 would fetch a entry by ID 20. And so on ...

The contents of the page are wrapper in some divs. The problem that I am facing is how do I iterate through all the numbers and fetch contents from the site.

I am sure the first step would be like this.

for($i = 0; $i < 20,000; $i++){

  $get = file_get_contents('http://foo.com/'.$i){
     $title = preg_match('[^title$.]', $get); 
     $title/... 
  }

}

The problem that I am having is with the code itself. I need to find a suitable regex pattern to read the HTML and find contents inside say ... div tags: <div class='title'> </div>

Now I know that regex is not effective in parsing HTMl, so I used what is called simple_html_dom that was not very effective too. So, I would like to know if there are any alternatives.

Second problem is that, while iterating through all those URIs. My computer crashes often. I supposed this is due to memory cap. So, I would like to know if there is a way in PHP to iterate in cycles. As in do some task, sleep/wait until it stables and then do the next task.

I am open to any ideas, on how to generally fetch similar contents online.

Look at DOMDocument and XPath - http://stackoverflow.com/questions/21552254/pull-recent-news-items-from-external-site-with-no-rss-feed-preg-match/21552781#21552781 — Jake N, May 30 '14 at 18:37
But `simple_html_dom` is also similar to DOMDocument and Xpath. and It didn't do me any good. — An_roid, May 30 '14 at 18:43
You will run into problems when trying to run all these jobs in a single script/loop. You need a more complex setup. I suggest you separate the jobs _now_ when it is still easy: execute a single job in a single script which is executed by some controller instance. That controller instance must use some means of persistent storage, a file or a single database to track progress. That way you can always pick up work where you stopped, whyever. — arkascha, May 30 '14 at 18:43
`DOMDocument` or `simple html dom parser`, regex out of question, unless you know the page _isn't going to change_ and what you want is a tiny bit of information, thus reducing the performance hit these parsers causing. — , May 30 '14 at 18:44
The page can not change. It is like stackoverflow. Only different contents when you go to `stackoverflow.com/questions/` it is the same div, and the same html/css structure. So, regex might be a good option — An_roid, May 30 '14 at 18:48

Fetching large chunks of data from online content.

0 Answers0