2

I have a pretty basic script that scrapes a website for data. It then does some processing on those URL's to extract the data. I am using the following:

use Guzzle\Http\Client;
use Guzzle\Plugin\Cookie\CookiePlugin;
use Guzzle\Plugin\Cookie\CookieJar\ArrayCookieJar;
use Symfony\Component\DomCrawler\Crawler;

I'm unable to post the code because it's proprietary. The company I am working for would fire me if I did.

I have one Guzzle Client in the script. I reuse this object for the HTTP requests.

$client = new Client();  

I use DomCrawler to scrape the needed data from the page. I loop through a long list of URL's, scraping the data from each one.

I get to about the 50th URL and notice that the memory is exhausted at 32MB. Rather than increase the memory limit I'd like to actually find out what's causing this.

Is there any way to force PHP to clear the memory of a Crawler object? And how can I track the memory usage to see where the memory is being used?

Update

I decided to print out the memory usage using:

memory_get_usage(true)

... within the loop before and after processing on the data. It seems the memory seems to increment constantly and never seems to go down.

Here's the output. Each block is a single iteration of the loop. Again, sorry I've removed the site URL's. I'm not allowed to post them here due to company I work for.

Scraped: site.com/page
Processing page: 4194304
Processed page: 4980736

Scraped: site.com/page
Processing page: 4980736
Processed page: 5505024

Scraped: site.com/page
Processing page: 5505024
Processed page: 6029312

Scraped: site.com/page
Processing page: 6029312
Processed page: 6815744

Scraped: site.com/page
Processing page: 6815744
Processed page: 7340032

Scraped: site.com/page
Processing page: 7340032
Processed page: 7864320

Scraped: site.com/page
Processing page: 7864320
Processed page: 8388608

Scraped: site.com/page
Processing page: 8388608
Processed page: 9175040

Scraped: site.com/page
Processing page: 9175040
Processed page: 9699328

Scraped: site.com/page
Processing page: 9699328
Processed page: 10223616
James Jeffery
  • 12,093
  • 19
  • 74
  • 108
  • Take a look on this answer: http://stackoverflow.com/questions/880458/php-memory-profiling – Gianpaolo Di Nino Jul 08 '13 at 19:17
  • Make sure you aren't holding on to resources by making a reference to them. You can also try running gc_collect_cylces() in your loop to force garbage collection or circular references. – Michael Dowling Jul 09 '13 at 16:37

1 Answers1

0

increase your memory limit in php.ini file

search for

; Maximum amount of memory a script may consume (128MB)
; http://php.net/memory-limit
memory_limit = 128M

in your php.ini file and increase it to 512M

Lance
  • 4,736
  • 16
  • 53
  • 90
  • I'm aware of this, but I'm not sure whether a simple web scraper should be exhausted at 32MB. – James Jeffery Jul 08 '13 at 19:23
  • I'm not sure how to track the exact memory. IF you're concerned that insufficient memory might not be the cause of this problem, then check to see if there's an unclosed loop anywhere. – Lance Jul 08 '13 at 19:25
  • The only loop in the code is the one that loops through the list of URL's. There can be any number of URL's to scrape, but the list I'm using at the moment has 148 lines. It exhausts at the 50th line. – James Jeffery Jul 08 '13 at 19:26
  • What html parser are you using? I've ran into a similar problem before using the simple html DOM parser. Check out this question for more help. http://stackoverflow.com/questions/16627637/fatal-error-allowed-memory-size-of-33554432-bytes-exhausted – Lance Jul 08 '13 at 19:28
  • I'm using Symphony's Dom Crawler. – James Jeffery Jul 08 '13 at 19:49