4

I'm doing a data-mine on millions of old log entries for someone and really want to use PHP in this matter to present the materials a link them easily to the existing PHP system.

I run this code in the PHP 5.4.4 in the Terminal (OSX 10.8):

// Settings
ini_set('error_reporting', E_ALL); // Shows all feedback from the parser for debugging
ini_set('max_execution_time', 0); // Changes the 30 seconds parser exit to infinite
ini_set('memory_limit', '512M'); // Sets the memory that may be used to 512MegaBytes


echo 'Start memory usage: '.(memory_get_usage(TRUE) / 1024)."\n";

$x = Array();
for ($i = 0; $i < 1e7; $i++) {
    $x[$i] = 1 * rand(0, 10);
    //unset($x[$i]);
}

echo 'End memory usage: '.(memory_get_usage(TRUE) / 1024)."\n";
echo 'Peak memory usage: '.(memory_get_peak_usage(TRUE) / 1024)."\n";

This is a simple test with ten-million cycles. The leakage is really bad compared to using dictionaries in Python :(.

When I unquote the unset() function to test the usage, it's instantly all unicorns and rainbows. So forcing the release of the memory seems to go well.

Is there any way I can still maintain 10-50 million array entries within that 512M memory limit?

I can't imagine when I would do some regex with these kind of loops either..

  • 1
    If by log entries you mean files then stop here and read the grep/zgrep page –  Feb 14 '13 at 17:45
  • 1
    *“Is there any way I can still maintain 10-50 million array entries within that 512M memory limit?”* Yeah, we call those databases. – Waleed Khan Feb 14 '13 at 17:52
  • Do you actually require holding all X million rows in the array at once? Do you store every single bit of data, or only what you actually need? Also, +1 to @WaleedKhan's comment. – Sammitch Feb 14 '13 at 18:36

2 Answers2

3

Use SplFixedArray because you really need to see How big are PHP arrays (and values) really? (Hint: BIG!)

$t = 1e6;
$x = array();
for($i = 0; $i < $t; $i ++) {
    $x[$i] = 1 * rand(0, 10);
}

Output

Start memory usage: 256
End memory usage: 82688
Peak memory usage: 82688

and

$t = 1e6;
$x = new SplFixedArray($t);
for($i = 0; $i < $t; $i ++) {
    $x[$i] = 1 * rand(0, 10);
}

Output

Start memory usage: 256
End memory usage: 35584
Peak memory usage: 35584

But better still i think you should consider a memory based database like REDIS

Baba
  • 94,024
  • 28
  • 166
  • 217
  • When run with 1 million items, I got a peak memory of 86,784KB used for SplFixedArray. – Alister Bulman Feb 14 '13 at 17:58
  • I can only get up to about 5.9Million cycles with this method. I guess that's the end of the line in case of optimization PHP-wise? :) –  Feb 14 '13 at 18:03
  • I'll accept this as a viable alternative to the problem. I decided to run shell from PHP to process the data through Python and pipe it back to PHP. I already gained tons of memory and it seems faster than pure PHP. –  Feb 14 '13 at 18:10
  • Good Idea but am still running some test .... would update you if i may any headway – Baba Feb 14 '13 at 18:16
  • @Allendar i think i found a better solution https://github.com/ircmaxell/php-ndata#benchmarks – Baba Feb 26 '13 at 17:20
0

If the SplFixedArray doesn't work for you I would strongly recomend the use of RabbitMQ -> http://www.rabbitmq.com/tutorials/tutorial-one-php.html

RabbitMQ is more simple to configure and use than normally people think and it has a good library for PHP.

With RabbitMQ your script can be ten, twenty, hundred times faster (depending the number of consumers you set) and you also can manage any amount of data.

I had use RabbitMQ to import milions of rows to retrieve information about all cars registered in Denmark imagine how big this can be.