1

I'm writing a command line application in PHP that accepts a path to a local input file as an argument. The input file will contain one of the following things:

  • JSON encoded associative array
  • A serialized() version of the associative array
  • A base 64 encoded version of the serialized() associative array
  • Base 64 encoded JSON encoded associative array
  • A plain old PHP associative array
  • Rubbish

In short, there are several dissimilar programs that I have no control over that will be writing to this file, in a uniform way that I can understand, once I actually figure out the format. Once I figure out how to ingest the data, I can just run with it.

What I'm considering is:

  • If the first byte of the file is { , try json_decode(), see if it fails.
  • If the first byte of the file is < or $, try include(), see if it fails.
  • if the first three bytes of the file match a:[0-9], try unserialize().
  • If not the first three, try base64_decode(), see if it fails. If not:
    • Check the first bytes of the decoded data, again.
    • If all of that fails, it's rubbish.

That just seems quite expensive for quite a simple task. Could I be doing it in a better way? If so, how?

Tim Post
  • 33,371
  • 15
  • 110
  • 174
  • 1
    This is a perfect example where convention does the job. If you know that *all* JSON files end with .json, then you don't need to parse. If you have no control over the environment though, it's rather unsafe to run untrusted code (with `include`). – rid May 24 '11 at 18:43
  • I would have the user simply indicate what type of file it is, then sanity check it. This automated method seems like it has too much potential to be hit by weird edge cases. – onteria_ May 24 '11 at 18:43
  • In what way will `include` fail? – lonesomeday May 24 '11 at 18:43
  • @rdineiu Unfortunately, The extension isn't possible. I'm reading a file named 'dump' while stitching together some dis-similar systems into one coherent front end. @lonesomeday `include()` fails if you can't get to a member after using it, thankfully, the members of the array are all the same .. just the format that differs. – Tim Post May 24 '11 at 18:47
  • 1
    When you write the file to disk, is it possible to insert an extra byte at the beginning of the file? If so, you could make that byte determine the type of file, then strip it. – rid May 24 '11 at 18:49
  • @rdineiu I'm not _always_ the writer, other programs that I have no control over might also change the file. – Tim Post May 24 '11 at 18:52

4 Answers4

2

There isn't much to optimize here. The magic bytes approach is already the way to go. But of course the actual deserialization functions can be avoided. It's feasible to use a verification regex for each instead (which despite the meme are often faster than having PHP actually unpack a nested array).

base64 is easy enough to probe for.

json can be checked with a regex. Fastest way to check if a string is JSON in PHP? is the RFC version for securing it in JS. But it would be feasible to write a complete json (?R) match rule.

serialize is a bit more difficult without a proper unpack function. But with some heuristics you can already assert that it's a serialize blob.

php array scripts can be probed a bit faster with token_get_all. Or if the format and data is constrained enough, again with a regex.

The more important question here is, do you need reliability - or simplicity and speed?

Community
  • 1
  • 1
mario
  • 144,265
  • 20
  • 237
  • 291
  • Regex would be cheaper than just examining the first few bytes? Speed is paramount, but I also need a low false alarm rate. – Tim Post May 24 '11 at 19:17
  • You should still do a manual $string[0] comparison at least. But PCRE is generally faster for verification. – mario May 24 '11 at 19:56
  • Ah, I see what you mean, I'm not seeing the forest, just lots of trees. Thanks, yes, a short match and then comparing $string[0] to validate rubbish would make better sense than what I was considering. Thank you for your help! – Tim Post May 24 '11 at 20:00
1

For speed, you could use the file(1) utility and add "magic numbers" in /usr/share/file/magic. It should be faster than a pure PHP alternative.

rid
  • 61,078
  • 31
  • 152
  • 193
  • That would probably work, but this has to be portable. It's one of those WTF? when you get the requirements, but have to do it anyway. Up voted because if you're only dealing with a GNU system, problem solved. – Tim Post May 24 '11 at 20:44
0

You can try json_decode() and unserialize() which will return NULL if they fail, then base64_decode() and run that again. It's not fast, but it's infinitely less error prone than hand parsing them...

rid
  • 61,078
  • 31
  • 152
  • 193
  • 1
    I was trying to come up with a way to make a 'best guess' hoping to avoid just going down the list of possibilities every time. This has to conceivably load and compare thousands of files in an hourly cron, speed really matters. – Tim Post May 24 '11 at 18:51
0

The issue here is that if you have no idea which it can be, you will need to develop a detection algorithm. Conventions should be set with an extension (check the extension, if it fails, tell whoever put the file there to place the correct extension on), otherwise you will need to check yourself. Most algorithms that detect what type a file actually is do use hereustics to determine it's contents (exe, jpg etc) because generally they have some sort of signature that identifies them. So if you have no idea what the content will be for definate, it's best to look for features that are specific to those contents. This does sometimes mean reading more than a couple of bytes.

Jase
  • 599
  • 3
  • 9