2

I've got a file cache, the files being downloaded from different urls. I'd like to save each file by the name of their url. These names can be quite long though, and I'm on a device using a FAT32 file system - so the long names are eating up resources well before I run out of actual disk space.

I'm looking for a way to shorten the filenames, have gotten suggestions to hash the strings. But I'm not sure if the hashes are guaranteed to be unique for two different strings. It would be bad if I accidentally fetch the wrong image if two hashed urls come up with the same hash value.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
user291701
  • 38,411
  • 72
  • 187
  • 285
  • 1
    I think you gonna find in trouble hashing filenames: hashes (IMHO) can produce duplicate entries... – Marco Sep 01 '11 at 18:49
  • When you say "long names are eating up resources well before I run out of actual disk space" , I'm feeling slightly skeptical. Not sure why though. But isn't storage pretty cheap anyway? – Caffeinated Sep 01 '11 at 18:50
  • 1
    @Marco, Agreed, hash can produce duplicates ("collisions"). You should make some collision-handler that tries a new hash if a collision occurs... – poplitea Sep 01 '11 at 18:51
  • Hashes are not guaranteed to be unique -- and some hashes (e.g. md5 or sha1) get their "uniqueness" simply based on *how large a space they encompass*. But then they might not be shorter at all ;-) –  Sep 01 '11 at 18:53
  • @Marco, poplitea: Hashes *can* produce collisions but the probability is so small as to be completely ignorable. Even if you're using MD5 (output size 128 bits) then with a billion entries the chance of a collision is still less than 10^-18. – Cameron Skinner Sep 01 '11 at 18:55
  • check this http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener – lctr30 Sep 01 '11 at 18:55
  • @Cameron, you're right; but I couldn't sleep with even one possibility I can have a duplicate!! :-) I'm joking, anyway there are so easy and fast solutions that having a collision risk is not a good idea ;) Look at my answer... simple solution without sweat – Marco Sep 01 '11 at 19:00
  • @Marco: Well, the OPs concern is about disk usage. Storing the full path eats up space and isn't going to solve the disk usage problem. – Cameron Skinner Sep 01 '11 at 19:03
  • @Cameron, I think OP concerns about the fact that FAT32 cannot have too many files in single dir... If this is true hashing filenames is not a solution. If he's worried about "long" filenames, my solution is probably shorter that hashing... Am I wrong? :) I think the problem is FAT occupation, not disk occupation... – Marco Sep 01 '11 at 19:07
  • @Marco: You can store files in prefix directories to reduce the number of files in a single directory. That's a pretty common approach - take a look at the way Git, for example, stores its blobs. It uses 2-character prefix directories. – Cameron Skinner Sep 01 '11 at 19:11
  • @Cameron: you're right. The point is: using hashes is risky (very little risk, but risky) so OP could want to find another solution... we proposed some, everyone good. He has to understand what he really needs I think... – Marco Sep 01 '11 at 19:15

7 Answers7

5

You could generate an UUID for each URL and use it as the file name.

UUIDs are unique (or "practically unique") and are 36 characters long, so I guess the file name wouldn't be a problem.

As of version 5, the JDK ships with a class to generate UUIDs (java.util.UUID). You could use randomly generate UUIDs if there's a way to associate them with the URLs, or you could use name based UUIDs. Name based UUIDs are always the same, so the following is always true:

String url = ...
UUID urlUuid = UUID.nameUUIDFromBytes(url.getBytes);
assertTrue(urlUuid.equals(UUID.nameUUIDFromBytes(url.getBytes)));
Andre Rodrigues
  • 2,214
  • 16
  • 14
3

There's no (shortening) hash which can guarantee different hashes for each input. It's simply not possible.

The way I usually do it is by saving the original name at the beginning (e.g., first line) of the cache file. So to find a file in the cache you do it like this:

  • Hash the URL
  • Find the file corresponding to that hash
  • Check the first line. If it's the same as the full URL:
  • The rest of the file is from line two and forward

You can also consider saving the URL->file mapping in a database.

Emil Vikström
  • 90,431
  • 16
  • 141
  • 175
2

But I'm not sure if the hashes are guaranteed to be unique for two different strings.

They very much aren't (and cannot be, due to the pigeonhole principle). But if the hash is sufficiently long (at least 64 bit) and well-distributed (ideally a cryptographic hash), then the likelihood of a collision becomes so small that it's not worth worrying about.

As a rough guideline, collisions will become likely once the number of files approaches the square root of the number of possible different hashes (birthday paradox). So for a 64 bit hash (10 character filenames), you have about a 50% chance of one single collision if you have 4 billion files.

You'll have to decide whether that is an acceptable risk. You can reduce the chance of collision by making the hash longer, but of course at some point that will mean the opposite of what you want.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
1

Currently, the SHA-1 algorithm is recommended. There are no known ways to intentionally provoke collisions for this algorithm, so you should be safe. Provoking collisions with two pieces of data that have common structure (such as the http:// prefix) is even harder. If you save this stuff after you get a HTTP 200 response, then the URL obviously fetched something, so getting two distinct, valid URLs with the same SHA-1 hash really should not be a concern.

If it's of any re-assurance Git uses it to identify all objects, commits and folders in the source code repository. I've yet to hear of someone with a collision in the object store.

André Caron
  • 44,541
  • 12
  • 67
  • 125
1

Hashes are not guaranteed to be unique, but the chance of a collision is vanishingly small.

If your hash is, say, 128 bits then the chance of a collision for any pair of entries is 1 in 2^128. By the birthday paradox, if you had 10^18 entries in your table then the chance of a collision is only 1%, so you don't really need to worry about it. If you are extra paranoid then increase the size of the hash by using SHA256 or SHA512.

Obviously you need to make sure that the hashed representation actually takes up less space than the original filename. Base-64 encoded strings represent 6 bits per character so you can do the math to find out if it's even worth doing the hash in the first place.

If your file system barfs because the names are too long then you can create prefix subdirectories for the actual storage. For example, if a file maps the the hash ABCDE then you can store it as /path/to/A/B/CDE, or maybe /path/to/ABC/DE depending on what works best for your file system.

Git is a good example of this technique in practice.

Cameron Skinner
  • 51,692
  • 2
  • 65
  • 86
  • Even a 128 bit hash probably defeats the original intent of making filenames shorter. – Michael Borgwardt Sep 01 '11 at 18:58
  • Base64-encoded is only 22 characters. If that's still too big for FAT32 then using a different file system is probably a better solution. Seriously, FAT32 is still in use? – Cameron Skinner Sep 01 '11 at 19:01
  • FAT32 can have much longer filenames. The concern seems to be with a very large *number* of long filenames. If the filenames are based on URLs, then using 22 character hashes probably still results in a decrease of the average length. But with two or four times that, probably not. – Michael Borgwardt Sep 01 '11 at 19:05
1

what you can do is save the files by an index and use a index file to find the location of the actual file

in the directory you have:

index.txt
file1
file2
...
etc.

and in index.txt you use some datastructure to find the filenames efficiently (or replace with a DB)

ratchet freak
  • 47,288
  • 5
  • 68
  • 106
0

Look at my comment.
One possible solution (there are a lot) is creating a local file (SQLite? XML? TXT?) in which you store a pair (file_id - file_name) so you can save your downloaded files with their unique ID as filename.
Just an idea, not the best...

Marco
  • 56,740
  • 14
  • 129
  • 152