I have a pretty basic script that scrapes a website for data. It then does some processing on those URL's to extract the data. I am using the following:
use Guzzle\Http\Client;
use Guzzle\Plugin\Cookie\CookiePlugin;
use Guzzle\Plugin\Cookie\CookieJar\ArrayCookieJar;
use Symfony\Component\DomCrawler\Crawler;
I'm unable to post the code because it's proprietary. The company I am working for would fire me if I did.
I have one Guzzle Client in the script. I reuse this object for the HTTP requests.
$client = new Client();
I use DomCrawler to scrape the needed data from the page. I loop through a long list of URL's, scraping the data from each one.
I get to about the 50th URL and notice that the memory is exhausted at 32MB. Rather than increase the memory limit I'd like to actually find out what's causing this.
Is there any way to force PHP to clear the memory of a Crawler object? And how can I track the memory usage to see where the memory is being used?
Update
I decided to print out the memory usage using:
memory_get_usage(true)
... within the loop before and after processing on the data. It seems the memory seems to increment constantly and never seems to go down.
Here's the output. Each block is a single iteration of the loop. Again, sorry I've removed the site URL's. I'm not allowed to post them here due to company I work for.
Scraped: site.com/page
Processing page: 4194304
Processed page: 4980736
Scraped: site.com/page
Processing page: 4980736
Processed page: 5505024
Scraped: site.com/page
Processing page: 5505024
Processed page: 6029312
Scraped: site.com/page
Processing page: 6029312
Processed page: 6815744
Scraped: site.com/page
Processing page: 6815744
Processed page: 7340032
Scraped: site.com/page
Processing page: 7340032
Processed page: 7864320
Scraped: site.com/page
Processing page: 7864320
Processed page: 8388608
Scraped: site.com/page
Processing page: 8388608
Processed page: 9175040
Scraped: site.com/page
Processing page: 9175040
Processed page: 9699328
Scraped: site.com/page
Processing page: 9699328
Processed page: 10223616