1

Suppose you have two datasets that you need to make sure that they have not changed. For example, you have an array of objects in one hand, and another array in the other hand. Now, you need to verify that both arrays are exactly the same.

Each array can contain any type data: boolean, strings, objects, arrays, NULL, etc.

When comparing both array contents should be exactly the same. Same data type and same order.

Instead of iterating over the array contents, with code that can compare different types of data, and possible recursive comparisons, I came with a solution that I would be grateful if you could shed a light if there is any downside in. PHP is the language, but I'm more interested in a language-neutral answer.

I serialized both datasets separately, and calculated their md5 hashes. I chose md5 because it is available without external extensions or libraries, and works quite fast. I am aware of chance of a collision, and md5 hashes are no where nearly cryptographically secure.

My question is that:

  • Is it a widely used method to validate the arbitrary types of data. Checking file checksums make sense, but I have not personally used it to compare variables like this.
  • I'm mainly doing this to keep my code simple. A comparison is probably faster because it can break the comparison whenever it finds a mismatch first. In my case, the length of the data is fairly small. About 5kb as a serialized string.
  • Are there any other downsites that I should know off.

Thanks in advance.

AKS
  • 4,618
  • 2
  • 29
  • 48
  • Starred - one interesting find is that `json_encode` [seem to be much faster](http://stackoverflow.com/questions/2254220/php-best-way-to-md5-multi-dimensional-array) than `serialize` – JimL Apr 10 '16 at 21:16
  • sounds like a good plan, if you can afford the odd failure –  Apr 10 '16 at 21:23

1 Answers1

1

If you're looking for changes in an array I would actually recommend using CRC32(). Like MD5() this function has been available in PHP since version 4 and requires no special libraries adding. However, CRC32() is actually meant for the purpose of error checking and is quicker than MD5(), which is meant as a hashing function and as such is slower by design.

Especially in terms of your language agnostic answer, I would always choose CRC32() over MD5() as it's much much simpler to find libraries for and it is much less computationally expensive making it ideal for pretty much every application, even embedded devices.

Garry Welding
  • 3,599
  • 1
  • 29
  • 46