There is a publicly created yellowpage website that I would like to download contents from. Basically, every single entry of the website's 20,000 content is indexed by a number after the URL. ex: foo.com/22
would fetch a entry by ID 20. And so on ...
The contents of the page are wrapper in some divs. The problem that I am facing is how do I iterate through all the numbers and fetch contents from the site.
I am sure the first step would be like this.
for($i = 0; $i < 20,000; $i++){
$get = file_get_contents('http://foo.com/'.$i){
$title = preg_match('[^title$.]', $get);
$title/...
}
}
The problem that I am having is with the code itself. I need to find a suitable regex pattern to read the HTML and find contents inside say ... div tags: <div class='title'> </div>
Now I know that regex is not effective in parsing HTMl, so I used what is called simple_html_dom
that was not very effective too. So, I would like to know if there are any alternatives.
Second problem is that, while iterating through all those URIs. My computer crashes often. I supposed this is due to memory cap. So, I would like to know if there is a way in PHP to iterate in cycles. As in do some task, sleep/wait until it stables and then do the next task.
I am open to any ideas, on how to generally fetch similar contents online.