17

I'm storing some "unstructured" data (a keyed array) in one field of my table, and i'm currently using serialize() / unserialize() to "convert" back and forth from array to string.

Every now and then, however, I get errors when unserializing the data. I believe these errors happen because of Unicode data in the strings inside the array i'm serializing, although there are some records with Unicode data that work just fine. (DB field is UTF-8)

I'm wondering whether using json_encode instead of serialize will make a difference / make this more resilient. This is not trivial for me to test, since in my dev environment everything works well, but in production, every now and then (about 1% of records) I get an error.

Btw, I know i'm weaseling out of finding an actual explanation for the problem and just blindly trying something, I'm kind of hoping I can get rid of this without spending too much time on it.

Do you think using json_encode instead of serialize will make this more resilient to "serialization errors"? The data format does look more "forgiving" to me...

UPDATE: The actual error i'm getting is:

 Notice: unserialize(): Error at offset 401 of 569 bytes in C:\blah.php on line 20

Thanks! Daniel

Daniel Magliola
  • 30,898
  • 61
  • 164
  • 243
  • Strikes me as quite an inefficient process to convert the string to/from an array/object every database access. – Brian Mar 18 '11 at 12:09
  • If your data is a keyed array, then it isn't unstructured and should be store in a correctly normalized table on the database – Mark Baker Mar 18 '11 at 12:10
  • 1
    If it's UTF8 that causes problem with the unserialize(), that implies that you probably didn't set PHP's internal encoding to UTF8. I know this isn't direct answer to your question - json_encode() vs unserialize() but have you tried with mb_internal_encoding("UTF-8"); and then unserialize()? – Furicane Mar 18 '11 at 12:12
  • it's always amazes me when someone mention some error, but **give not a thinnest hint of what particular error it is**. And of course there is not a trace of the actual buggy data example. Everything is virtual. Everyone to guess. – Your Common Sense Mar 18 '11 at 12:17
  • @Brian: Storing this in a structured way would be *way* more inefficient. @Mark: Agreed, but it's *much* more convenient to store it this way, since i can have different structures, and I never need to "query" these fields, this is much more flexible and simple – Daniel Magliola Mar 18 '11 at 12:19
  • @Col.Shrapnel: Just added the error. It's not particularly informative, which is why I left it out. Unserialize always gives you the same error, and the offset is not telling me much when I look at the string. – Daniel Magliola Mar 18 '11 at 12:22
  • That's why you have post actual data as well. Because it's is not telling you much. So, it needs to be examined by more experienced people. – Your Common Sense Mar 18 '11 at 12:23
  • Yeah, I know, and I would love to, but that actual data contains sensitive information I can't show here, and obviously, masking that out will probably hide the problem. – Daniel Magliola Mar 18 '11 at 12:25
  • OMG, "sensitive information". "I wanna ask you a question but i provide no information cause its top seeeeeeeecret so you have to waste your time guessing important details" – Your Common Sense Mar 18 '11 at 12:28
  • @Furicane: calling mb_internal_encoding before unserialize didn't make any difference. I'm getting the same error on the same offset. Not sure whether that means the problem is not with UTF-8, or not, to be honest. – Daniel Magliola Mar 18 '11 at 12:29
  • @Col. Shrapnel: I'm really not screwing around with you, you think I don't want your help? Thank you for your help, by the way. – Daniel Magliola Mar 18 '11 at 12:30
  • it's community I m speaking of. you're sponging on it, trying to save your time making someone else waste it. – Your Common Sense Mar 18 '11 at 12:32
  • 1
    The PHP serialize format is unimmunized against string length changes due to multibyte encoding variations. This could be that problem for charset bugs. With JSON you will likewise have to rely on a correct UTF-8 representation. So the resiliency advantage is mostly theoretical. -- Anyway, if this is a serious issue, but not debuggable, then use a binary field or base64/hex marshalling for the whole blob. (This could be undone in the DB if there is a need.) – mario Mar 18 '11 at 12:34
  • 1
    @Daniel - at php.net/unserialize people left many useful comments for unserializing utf8 encoded data. You might want to try out their code before moving on to change of approach. – Furicane Mar 18 '11 at 12:35
  • @mario: Thank you for your answer. The reason I'm trying to avoid base 64 is so i'd be able to easily read the contents of the DB table directly (which i've had to do several times). I'm not sure if I understand correctly your comment. Since PHP in unimmunized against string length changes, assuming everything's correctly set to UTF-8 (DB connection, DB field, PHP, etc), does that mean I could still have problems with serialize that i wouldn't with JSON? Or am I completely misunderstanding? Thanks! – Daniel Magliola Mar 18 '11 at 12:38
  • 1
    Can't tell without a hexdump. But if you get a corrupt UTF-8 sequence, then the DB might return it stripped or replaced with U+DCxx (don't know exactly). Then the serialize format internal strlen will be off, thus corrupting the whole blob. -- So JSON might work better, except that PHPs `json_decode()` as easily refuses to operate when encountering invalid UTF-8 **or** JS string escape sequences. -- Regarding base64 - there must certainly be stored procedures to decode it on-the-fly. – mario Mar 18 '11 at 12:43

7 Answers7

15

JSON has one main advantage :

  • compatibility with other languages than PHP.

PHP's serialize has one main advantage :

  • it's specifically designed to store PHP-based data -- most notably, it can store serialized objects, instance of classes, that will be re-instanciated to the right class-type when the string is unserialized.

(Yes, those advantages are the exact opposite of each other)


In your case, as you are storing data that's not really structured, both formats should work pretty well.

And the encoding problem you have should not be related to serialize by itself : as long as everything (DB, connection to the DB, PHP files, ...) is in UTF-8, serialization should work too.

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • The connection to DB is UTF-8. The PHP file is too. I honestly don't know enough about how PHP handles UTF-8 to know where the problem could be. I don't even know whether it's related to UTF-8, but it's the main thing I can think of, since I understand that PHP's handling of it is not exactly stellar. Any ideas of what other problem I might be having? Thanks! – Daniel Magliola Mar 18 '11 at 12:32
2

json_encode() converts non-ASCII characters and symbols (e.g., “Schrödinger” becomes “Schr\u00f6dinger”) but serialize() does not.

Source: https://www.toptal.com/php/10-most-common-mistakes-php-programmers-make#common-mistake-6--ignoring-unicodeutf-8-issues


To leave UTF-8 characters untouched, you can use the option JSON_UNESCAPED_UNICODE as of PHP 5.4.

Source: https://stackoverflow.com/a/804089/1438029

SKisby
  • 35
  • 7
Geoffrey Hale
  • 10,597
  • 5
  • 44
  • 45
2

I think unless you absolutely need to preserve php specific types that json_encode() is the way to go for storing structured data in a single field in MySQL. Here's why:

https://dev.mysql.com/doc/refman/5.7/en/json.html

As of MySQL 5.7.8, MySQL supports a native JSON data type defined by RFC 7159 that enables efficient access to data in JSON (JavaScript Object Notation) documents

If you are using a version of MySQL that supports the new JSON data type you can benefit from that feature.

Another important point of consideration is the ability to perform changes on those JSON strings. Suppose you have a url stored in encoded strings all over your database. Wordpress users who've ever tried to migrate an existing database to a new domain name may sympathize here. If it's serialized, it's going to break things. If it's JSON you can simply run a query using REPLACE() and everything will be fine. Example:

$arr = ['url' => 'http://example.com'];
$ser = serialize($arr);
$jsn = json_encode($arr);

$ser = str_replace('http://','https://',$ser);
$jsn = str_replace('http://','https://',$jsn);

print_r(unserialize($ser));
PHP Notice:  unserialize(): Error at offset 39 of 43 bytes in /root/sandbox/encoding.php on line 10
print_r(json_decode($jsn,true));

Array ( [url] => https://example.com )

1

As I'm going through this I'll give my opinion, both serialize and json_encode are good for storing data in DB, but for those looking for performance, I've tested and I get these results, json_encode are a little microsegunds faster tham serialize, i used this script to calculate a the difference time.

$bounced =array();
for($i=count($bounced); $i<9999; ++$i)$bounced[$i]=$i;


$timeStart = microtime(true);
var_dump(serialize ($bounced));
unserialize(serialize ($bounced));
print timer_diff($timeStart) . " sec.\n";
$timeStart = microtime(true);
var_dump(json_encode ($bounced));
json_decode(json_encode ($bounced));
print timer_diff($timeStart) . " sec.\n";

function timer_diff($timeStart)
{
    return number_format(microtime(true) - $timeStart, 3);
}
khalid
  • 121
  • 8
1

If the problem is (and I believe it is) in UTF-8 encoding, there is not difference between json_encode and serialize. Both will leave characters encoding unchanged.

You should make sure your database/connection is properly set up for handle all UTF-8 characters or encode whole record into supported encoding before inserting to the DB.

Also please specify what "I get an error" means.

Petr Peller
  • 8,581
  • 10
  • 49
  • 66
  • @Col. Shrapnel: The OP did not provide enough information in time I wrote this post so believing was the only one option :) – Petr Peller Mar 29 '11 at 14:48
1

Found this in the PHP docs...

function mb_unserialize($serial_str) { 
    $out = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $serial_str ); 
    return unserialize($out); 
} 

I don't quite understand it, but it worked to unserialize the data that I couldn't unserialize before. Moved to JSON now, i'll report in a couple of weeks whether this solved the problem of randomly getting some records "corrupted"

Daniel Magliola
  • 30,898
  • 61
  • 164
  • 243
0

As a design decision, I'd opt for storing JSON because it can only represent a data structure, whereas serialization is bound to a PHP data object signature.

The advantages I see are: * you are forced to separate the data storage from any logic layer on top. * you are independent from changes to the data object class (say, for example, that you want to add a field).

Sorin Mocanu
  • 936
  • 5
  • 11