3

I have this situation where I have files on the HDD and I want to cache information about them in a database. Information that would otherwise take a long time to parse given that some of these files can run into GBs.

My first intuition is to use the file path as a unique identifier for a file and I use that as the key (TEXT/VARCHAR) and store the information as value in a database table.

Given that under some file systems (especially in *nix), file paths can be of unlimited length. It seems like a bad idea to use file name as a primary key in a database. Just indexing on a string field is going to be much slower, not to mention memory/space constraints.

I thought, maybe, I generate SHA-256 hash from the full file path (/usr/xxx/1/2/../abc.xyz) and use this as primary key (fixed width) in my database. Another thought, would be to generate the SHA-256 hash from file contents. However, this can also become quite time consuming.

My question is - in this scenario, are hash collisions as equally less likely, as the answer provided on this excellent thread.

Community
  • 1
  • 1
Code Poet
  • 11,227
  • 19
  • 64
  • 97
  • This is a duplicate of http://stackoverflow.com/q/4014090/1445356 – arserbin3 May 13 '14 at 20:37
  • @arserbin3 I disagree. It is very similar, but not really a duplicate except at the technical level. Under your definition of duplicate, anything asking about collisions for hashes would be a duplicate of the one you have given us a link to. The question you linked hashes file contents and this question hashes file paths. – Michael J. Gray May 14 '14 at 03:27
  • 1
    @Micharel And the question linked has an answer with 12 upvotes clearly answering 'The possibility of a collision does not depend on the size of the files, only on their number.' .. A hash is a hash. If it's files, file paths, dates,names,whatever - it only depends on the number. Its redundant to say 'what about hashing oranges, not apples?'. This is a clear duplicate. – arserbin3 May 14 '14 at 04:23
  • 2
    @arserbin3 Well, a hash is not really a hash. Cryptographers don't actually know what the true output space of any SHA-2 variant is. We know the output size is 256, 384, or 512 bits but that doesn't correspond to 2^256, 2^384, or 2^512 possible outputs. The number could be far lower and it very well could depend on the length of the input. I would argue that you are far more likely to find a hash using 1 MB inputs than you are 20 byte inputs, due to the fact that more iterations of the same function are performed against a larger data set, thus cascading unknown defects in the algorithm. – Michael J. Gray Jun 21 '14 at 19:04

1 Answers1

1

To answer your question, you would only be likely have issues with hash collisions if you were to approach 2^128 files in your table. This assumes that all inputs are between 0 .. +INF in length and that the hash algorithm you are using is perfect (SHA-256 is considered perfect in practice but not proven in theory) and that the output size is exactly 256 bits.

If you have under a few billion files, you should be fine.

Now for my recommendation. I would say that you need to tell us more information about your intended use. Your first thought is closer to correct than your hashing approach.

I would use a table like this (T-SQL Syntax for SQL Server):

CREATE TABLE [File]
(
    [Id] BIGINT IDENTITY NOT NULL,
    [Path] CHARACTER VARYING(MAX) NOT NULL

    PRIMARY KEY([Id])
);

CREATE NONCLUSTERED INDEX [File_Path_IX] ON [File]([Path]);

Then, you should let your database take care of indexing and making the searches fast. If and only if you experience a major performance issue later down the road, demonstrated by profiling, should you consider changing to a hashing approach. The hashing will impose massive computational penalty on your preprocessing and will introduce complicated scenarios such as hash collisions and trying to resolve them if and when they do happen.

Michael J. Gray
  • 9,784
  • 6
  • 38
  • 67
  • 1
    To the individual that has down voted this question, please provide some kind of information as to why you chose to do that. It's not helpful to anyone if you just tick it down without allowing anyone to improve the quality of their answers. – Michael J. Gray May 14 '14 at 16:52
  • 2
    I didn't downvote, but whoever did might have thought you were underestimating SHA-256. 2^128 is much larger than a few billion (and SHA-256 theoretically has 2^256 possibilities if it is perfect). People rely on there not being one collision even if you were to hash every 1MB chunk of data in the world, I have read. Nobody should be worried about a collision. As you said, the thing to worry about is that it's slower. – sudo Jul 15 '14 at 02:07
  • @9000 People incorrectly rely on there not being one single collision. If you were to hash 1 MB chunks of all of the data in the world, well then you might experience a collision on the first and second chunks, or never. It's not able to be determined easily (that's why it's good!), but mathematically it only makes sense that there would be a collision at some point. Do remember that the cryptography community once thought MD5 was as secure as we think SHA-256 is now; 10 years changed everything. – Michael J. Gray Jul 15 '14 at 19:24
  • Doing the math you find out that finding a collision amounts to multiple lottery wins in a row. No practical project, not even the Mars Rover, requires this tiny risk to be mitigated. – usr Jul 31 '17 at 14:57