I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.