17

I am designing a binary file format from scratch, and I would like to include some magic bytes at the beginning so that it can be identified easily. How do I go about choosing which bytes? I am not aware of any central registry of magic numbers, so is it just a matter of picking something fairly random that isn't already identified by, say, the file command on a nearby UNIX box?

JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
jl6
  • 6,110
  • 7
  • 35
  • 65
  • Have a look at this question, it mentions a database of magic numbers: http://stackoverflow.com/questions/55869/determine-file-type-of-an-image –  Aug 25 '10 at 12:58
  • FILE SIGNATURES TABLE: http://www.garykessler.net/library/file_sigs.html –  Aug 25 '10 at 12:59
  • Dated (as in "expired draft RFC"), but interesting: https://tools.ietf.org/html/draft-main-magic-00 – Roger Lipscombe Jul 14 '16 at 09:14

2 Answers2

22

Stay away from super-short magic numbers. Just because you're designing a binary format doesn't mean you can't use a text string for identifier. Follow that by an EOF char, and as an added bonus people who cat or type your binary file won't get a mangled terminal.

snemarch
  • 4,958
  • 26
  • 38
  • 3
    There's no character that'll make `cat` stop reading prematurely (that I can find), so someone who `cat`s a binary format is going to have a mangled terminal whatever you do. The "substitute" character (`1A`) is the one you want for `type` though. – Daisy Leigh Brenecki Feb 18 '17 at 01:37
1

There is no universally correct way. Best practices can be suggested, but these often situational. For example, if you're checking the integrity of volatile memory, which has an undefined initial state when power is applied, it may be beneficial to incorporate many 0s or 1s in a sequence (i.e. FFF0 00FF F000) which can stand out against random noise.

If the file is mostly binary, a popular choice is using a text encoding like ASCII which stands out among the binary data in a hex editor. For example, GIF uses GIF89a, FLAC uses fLaC. On the other hand, a plain text identifier may be falsely detected in a random text file, so invalid/control characters might be incorporated.

In general, it does not matter that much what they are, even a bunch of NULL bytes can be used for file detection. But ideally you want the longest unique identifier you can afford, and at minimum 4 bytes long. Any identifier under 4 bytes will show up more often in random data. The longer it is, the less likely it will ever be detected as a false positive. Some known examples are as long as 40 bytes. In a way, it's like a password.

Also, it doesn't have to be at offset 0. The file signature has conventionally been at offset zero, since it made sense to store it first if it will be processed first.

That said, a single file signature should not be the only line of defense. The actual parsing process itself should be able to verify integrity and weed out invalid files even if the signature matches. This can be done with additional file signatures, using length-sensitive data, value/range checking, and especially, hash/checksum values.

bryc
  • 12,710
  • 6
  • 41
  • 61