1

After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.

I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?

My question is then, what are the basics of implementing a SHA-1/file-storage system?

If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?

I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.

Community
  • 1
  • 1
Lance
  • 75,200
  • 93
  • 289
  • 503
  • i think the idea is to compute the hash just before storing at the path (after you did change the content of the file). that way you should have no problems? – tosh Nov 22 '09 at 17:23
  • so if all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash? – Lance Nov 22 '09 at 17:25
  • You just use the initial hash. You don't need to keep updating. Save the hash in the database. – Kenji Kina Nov 22 '09 at 17:27
  • hashing solves 2 problems (picking names for files that make collisions unlikely and avoiding performance issues on some filesystems that handle large amounts of files in a dictionary poorly by splitting the hash into pieces that form the path). this has become sort of best practice and some frameworks help you with it. though you might get more precise/applicable solutions/answers to your situation if you specify what you intend to do with the files :) – tosh Nov 22 '09 at 17:48

4 Answers4

5

First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.


Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.

When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:

 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt

This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.

Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.

Juha Syrjälä
  • 33,425
  • 31
  • 131
  • 183
  • 1
    okay, then you'd store the file name and some tags with the hash and whatever else in the database, and you could get the file from the filesystem with the hash. that way you could store some human readable info with the file reference, but not have the file itself. the hash is just for making it uniform, optimized, and easy to program, it doesn't need to be human readable. got it, thanks! – Lance Nov 22 '09 at 17:32
  • @viatropos, yep, thats about it. You could also give every file an unique number from sequence and use that instead of the SHA-1 digest. – Juha Syrjälä Nov 22 '09 at 17:40
  • if you intend to replace the old file with the new version using the same path, make sure the operation is atomic. else you might get into trouble if someone requests the file while you are still writing down the new one. imho it would not hurt to save the new file to another location. and think about a way to remove the old/outdated versions from time to time if you run into storage-space problems :) – tosh Nov 22 '09 at 18:02
  • But what about deleting? Example: 2 users uploaded the same file. So only one file will exists because the hash (path) is the same. When one of them delete the photo then second user will lost it also. Am I right? – binball Mar 20 '13 at 12:23
  • You must then keep a counter for each file that tells how many "copies" there are of given file. On each update, increment counter, on each delete, decrement counter. If the counter goes to zero, then it is safe to remove file. You can keep the counters e.g. in database. – Juha Syrjälä Mar 20 '13 at 18:43
  • There is a potential issue with hash collisions if you only use the digest as a file name. I think solving it is beyond the scope of the question, but definitely you might want to add some note about it in your answer. – To마SE Jun 10 '14 at 09:40
3

The idea is not to change the file content, but rather its name (and path), by using a hash value.

Changing the content with a hash would be disastrous since a hash is normally not reversible.

I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:

  • the file names on the disk is uniform
  • the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
  • the name becomes a code, making it difficult for someone to a) guess a file name b) categorize pictures (would someone steal the hard drive content)
  • be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)

The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)

In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...

Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)

mjv
  • 73,152
  • 14
  • 113
  • 156
  • If you use hash then several identical copies of the same files are stored to same location. With random number, the files will be stored to different locations. This may be an advantage or disadvantage, depending on your case. – Juha Syrjälä Nov 22 '09 at 17:44
  • what does someone like flickr or facebook do? – Lance Nov 22 '09 at 17:52
  • here is some interesting information about facebook's photo storage infrastructure http://www.facebook.com/note.php?note_id=76191543919 – tosh Nov 22 '09 at 18:28
  • thanks! http://stackoverflow.com/questions/1779609/how-big-of-a-team-does-it-take-to-make-this-huge-of-a-file-uploading-architecture – Lance Nov 22 '09 at 19:03
2

One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.

However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.

Cosmin
  • 21,216
  • 5
  • 45
  • 60
Techlands
  • 21
  • 1
1

The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.

So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.

This has the additional benefit of detecting duplicate uploads.

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329