1

I plan to build a unique file name based on its content. For example, by its SHA256 hash. Files with the same content must have the same name.

The easiest way is to convert hash to a hex string. A file name will be 32 bytes length * 2 = 64 characters. This is pretty long name to operate with. How to make it shorter?

I implemented a sort of "Base32" coding - a vocabulary string that includes digits and 22 letters. I use only five bits of every byte to build file name with 32 characters. Much better.

I am looking for a balance between file name length and low collision probability. If the number of files is expected to be less than 500K, how long should the filename be? 8? 16? 24? 32?

Is there any recommended method to build short unique filenames at all?

wmlab
  • 13
  • 3
  • if the filenames don’t need to be legible, we can generate them by bruteforce, for 500k files it should be fairly fast and the length of names would be between 1-5(using 26 alphabets). We can make it shorter by including non-alphabetic characters too. – sharpnife Jan 17 '23 at 13:49
  • 1
    Using A-Z,a-z,0-9, plus `-` and `_`, you can write the hashes in base-64 instead of base-16 or base-32. – Stef Jan 17 '23 at 15:48
  • 1
    Actually that's what `base64.urlsafe_b64encode` does in python: see the answer to [Short Python alphanumeric hash with minimal collisions](https://stackoverflow.com/questions/2510716/short-python-alphanumeric-hash-with-minimal-collisions) – Stef Jan 17 '23 at 16:01
  • 1
    For your question about reasonable filename lengths, see [Probability of collision with truncated SHA-256 hash?](https://stackoverflow.com/questions/19962424/probability-of-collision-with-truncated-sha-256-hash) or [Wikipedia: Birthday problem # Probability table](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table). – Stef Jan 17 '23 at 16:12
  • With 500.000 files and assuming you keep the first 8 characters of the SHA256 in base-64 encoding, you get approximately 0.1% chance of a collision, which can be considered high for any serious system, but this probability drops very fast if you add a few more characters. – Stef Jan 17 '23 at 16:23
  • Stef, NTFS is case-insensitive file system, so I implemented Base32, not Base64 encoding. – wmlab Jan 17 '23 at 18:52
  • Taking the SHA256 hash of a file is a pretty expensive operation. Unless there's some other reason for the filename depending solely on the SHA256 hash (for example, you want the ability to verify authenticity later by generating the SHA256 and comparing it against the filename) you might be better off with something much simpler. A GUID, for example, that you compress with base-64? – Jim Mischel Jan 17 '23 at 18:53
  • Stef, yes. I calculated probability for twelve Base32 characters below. It is going to be 1e-5% for 500K hashes, which is very good to me. Thank you! – wmlab Jan 17 '23 at 19:54

2 Answers2

0

Number of collisions depend on the content of the files, the hash-algorithm and the length of the hash.

In general: The longer the hash-value is the less likely are collisions (if your content does not especially provoke collisions).

You cannot avoid the possibility of collisions unless you use the content as file-name (or a lossless compression of it).

To shorten the filenames you could allow more different characters for the file-name. (But we aware what characters your OS allows and which you are willing to use).

I would go for a kind of base32 encoding to avoid problems with filesystems that do not distinguish between upper and lower case character.

MrSmith42
  • 9,961
  • 6
  • 38
  • 49
0

If you use an N-bit cryptographic hash on M files, you can estimate the probability of at least one collision to be M2/2N+1

For 500K files, that's about 1/2N-37

Using base32, 16 chars gives probability of collision 1/243 -- a few trillion to 1 odds.

If that won't do, then 24 chars gives 1/283.

If you're willing to check them all and re-generate on collision, then 8 chars is fine.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Thank you for the formula, Matt! Eight characters in Base32 are 8*5=40 bits, so the probability to get a collision within 500K hashes is 11%, which is not good. Nine chars make 0.35%, ten - 0.01%, twelve - 1e-5%. I believe 11 or 12 Base32 chars make pretty robust file name in my case. Thank you! – wmlab Jan 17 '23 at 19:32
  • If you're going to check and regenerate, then 11% chance of collision means only 12% extra time required to generate a unique set on average -- maybe a small price to pay for short names. – Matt Timmermans Jan 17 '23 at 22:17