I have this situation where I have files on the HDD and I want to cache information about them in a database. Information that would otherwise take a long time to parse given that some of these files can run into GBs.
My first intuition is to use the file path as a unique identifier for a file and I use that as the key (TEXT/VARCHAR) and store the information as value in a database table.
Given that under some file systems (especially in *nix), file paths can be of unlimited length. It seems like a bad idea to use file name as a primary key in a database. Just indexing on a string field is going to be much slower, not to mention memory/space constraints.
I thought, maybe, I generate SHA-256 hash from the full file path (/usr/xxx/1/2/../abc.xyz) and use this as primary key (fixed width) in my database. Another thought, would be to generate the SHA-256 hash from file contents. However, this can also become quite time consuming.
My question is - in this scenario, are hash collisions as equally less likely, as the answer provided on this excellent thread.