Fastest way to manage large ammount of data with PHP? (data upto 100mb altogether per client request)

Question

The situation:

I have multiple arrays of containing multiple complex objects, each storing different data, but in same format.
Now, these arrays (containing objects) are too complex to be stored in a sql table, so i serialize them, and store each array in separate files.
I use PHP function file_get_contents() read the data, and then i use unserialize() on the data.
I have to load one file (max 100mb) per client request and 'unserialize()' it and process it.
This data is not the same for every client
All data in total is around 3GB.
This data is updated every 24 hours, and the size of data is increased per update.
Maximum data per file is 100mb.

The problem:

The method i currently am using works fine for small file sizes (upto 5mb).
But when it comes to larger files of sizes, its taking too much time.
The function unserialize() is taking about 33 seconds to execute if i try to load a file with size around 40mb.
So the main problem with my current method is with unserialize().

The main question:

How can i store my very complex objects without serializing them, or how can i make my unserialization faster?

Can you try to explain the situation better. What are your serializing and why? — Gordon, Oct 27 '12 at 12:02
This is an interesting question, but as Gordon said - can you give us more information? Are you deserializing the data you fetch via file_get_contents? Are the files you are fetching on the same machine or do you fetch them remotely via file_get_contents? — N.B., Oct 27 '12 at 12:04
If the serialization is your bottleneck, switch to [igbinary serialization](http://pecl.php.net/package/igbinary), it is more speed optimized. Otherwise take a file format that does not need to pull all data into php memory space (e.g. sqlite or maybe xml). But that's just what @Gordon asks for. Handle with care. — hakre, Oct 27 '12 at 12:04
3GB is nit that large for a database. Plus, you will be able to store data in columns if you can make a nice schema and avoid serialize. Also have a look at Mongo db. It is not a 1:1 solution for mysql though. — AKS, Oct 27 '12 at 12:29
@Gordon i have now updated the question. i am serializing the objects because they are too much complicated to be stored in SQL. I have an array, containing objects, which then contain variables and even more arrays.... — Peeyush Kushwaha, Oct 27 '12 at 12:36
@N.B. yes i am unserializing the data i get from file_get_contents, in fact, that is the main culprit! — Peeyush Kushwaha, Oct 27 '12 at 12:37
@PeeyushKushwaha I find your setup to be … well … weird and voted to close as too localized, since I don't think many people will have this problem. But hakres suggestion to use igbinary is likely what you are looking for. — Gordon, Oct 27 '12 at 12:38
You might need to take a step back, and look at how your extremely (?) complex objects came to be. You might be better off making an object that can create itself from data that you have in the database. So you are not saving your "waay to complicated~!~" objects in your database, you should save the base-information there, and make a constructor that can read the data and create the object. — Nanne, Oct 27 '12 at 13:43

score 2 · Accepted Answer · edited May 23 '17 at 10:34

How can i store my very complex objects without serializing them, or how can I make my unserialization faster?

If you need PHP objects that are not stdClass (you have class definitions next to data-members) you need to use any kind of PHP compatible serialization.

Independent to the PHP language, serialization comes with a price because it is data transformation and mapping. If you have a large amount of data that needs to be transposed from and into string (binary) information, it takes its processing and memory.

By default is PHP's built-in serialization that you make use with serialize and unserialize. PHP offers two default serialization types. Other extensions offer something similar. Related question:

What is the php_binary serialization handler?

As you've said you need some kind of serialization and unserializing is the bottleneck, you could consider to choose another serializer like igbinary.

However, storing PHP in flat files works, too. See var_export:

// storing
file_put_contents(
    'name-of-data.php', '<?php return ' . var_export($data, true) . ';'
);

This example stores data in a format that PHP can read the file back in. Useful for structured data in forms of stdClass objects and arrays. Reading this back in is pretty straight forward:

// reading
$data = include('name-of-data.php');

If you put the PHP code into the database, you don't need the <?php prefix:

// storing
$string = 'return ' . var_export($data, true) . ';';
$db->store($key, $string);

// reading
$string = $db->read($key);
$data = eval($string);

The benefit of using var_export is that you make use of PHP itself to parse the data. It's normally faster than serialize / unserialize but for your case you need to metric that anyway.

I suggest you try with var_export how it behaves in terms of file-size and speed. And also with igbinary. Compare then. Leave the updated information with your question when you gather it, so additional suggestion can be given in case this does not solve your issue.

Another thing that comes to mind is using the Json format. Some data-stores are optimized for it, so you can query the store directly. Also the map-reduce methodology can be used with many of these data-stores so you can spread processing of the data. That's something you won't get straight with serialize/unserialize as it's always processing one big chunk of data at a time, you can not differ.

catalin.costache · Answer 2 · 2012-10-27T15:11:27.620

A better alternative for the internal php serialization will be MessagePack: http://msgpack.org/

It is faster, smaller and has support for almost any language.

You can find the php extension on pecl (pecl install msgpack-beta)

I did a simple benchmark to compare php internal serialization, message pack and json for a fairly big object (75M serialized in php). Json is out of discution because it cannot serialize objects, but it fails fast :). The results are shown bellow (in seconds):

___________
msgpack
___________
Serialization: 0.203326
File size 33976K
Deserialization: 0.787364

___________
serialize
___________
Serialization: 1.912351
File size 75971K
Deserialization: 0.861699

___________
json
___________
Serialization: 0.000019
File size 0K
Deserialization: 0.000023

Here is the gist with the source code for this benchmark https://gist.github.com/3964906#comments

As you can see, message pack performs clearly better than php's serialization, but even so, unserialize didn't perform as badly as you describe it

Interesting, didn't know about msgpack. But deserialization seems to be similar costly. Serialization looks impressive. — hakre, Oct 27 '12 at 15:04

Fastest way to manage large ammount of data with PHP? (data upto 100mb altogether per client request)

2 Answers2