19

Given these two images from twitter.

http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg

I want to download them to local filesystem & store them in a single directory. How shall I overcome name conflicts ?

In the example above, I cannot store them as lowres_profilepic.jpg. My design idea is treat the URLs as opaque strings except for the last segment. What algorithms (implemented as f) can I use to hash the prefixes into unique strings.

f( "http://a3.twimg.com/profile_images/130500759/" ) = 6tgjsdjfjdhgf
f( "http://a1.twimg.com/profile_images/58079916/" )  = iuhd87ysdfhdk

That way, I can save the files as:-

6tgjsdjfjdhgf_lowres_profilepic.jpg
iuhd87ysdfhdk_lowres_profilepic.jpg

I don't want a cryptographic algorithm as it this needs to be a performant operation.

Jacques René Mesrine
  • 46,127
  • 27
  • 66
  • 104
  • 4
    Have you actually benchmarked cryptographic hashes on your platform? Unless you're using 20 year old hardware, it's highly unlikely that hashing a short string is going to be in the same ballpark as, say, fetching the image in the first place. – Nick Johnson Oct 28 '09 at 10:27

12 Answers12

18

Irrespective of the how you do it (hashing, encoding, database lookup) I recommend that you don't try to map a huge number of URLs to files in a big flat directory.

The reason is that file lookup for most file systems involves a linear scan through the filenames in a directory. So if all N of your files are in one directory, a lookup will involve 1/2 N comparisons on average; i.e. O(N) (Note that ReiserFS organizes the names in a directory as a BTree. However, ReiserFS seems to be the exception rather than the rule.)

Instead of one big flat directory, it would be better to map the URIs to a tree of directories. Depending on the shape of the tree, lookup can be as good as O(logN). For example, if you organized the tree so that it had 3 levels of directory with at most 100 entries in each directory, you could accommodate 1 million URLs. If you designed the mapping to use 2 character filenames, each directory should easily fit into a single disk block, and a pathname lookup (assuming that the required directories are already cached) should take a few microseconds.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • 3
    Nowadays filesystems usually use trees for their file structure. – Gumbo Dec 12 '09 at 10:03
  • 1
    There are other reasons why big flat directories can lead to performance problems; e.g. programs that read and sort directory entries. – Stephen C Sep 28 '12 at 08:34
  • 1
    even today Windows still falls over itself if you have a bazillion files in one folder. plus if you had a number of folders you could delete them one folder at a time to prevent your entire cache expiring at once. – Simon_Weaver May 18 '19 at 08:11
10

It seems what you really want is to have a legal filename that won't collide with others.

  • Any encoding of the URL will work, even base64: e.g. filename = base64(url)
  • A crypto hash will give you what you want - although you claim this will be a performance bottleneck, don't be sure until you've benchmarked
orip
  • 73,323
  • 21
  • 116
  • 148
5

A very simple approach:

f( "http://a3.twimg.com/profile_images/130500759/" ) = a3_130500759.jpg
f( "http://a1.twimg.com/profile_images/58079916/" )  = a1_58079916.jpg

As the other parts of this URL are constant, you can use the subdomain, the last part of the query path as a unique filename.

Don't know what could be a problem with this solution

guerda
  • 23,388
  • 27
  • 97
  • 146
4

The nature of a hash is that it may result in collisions. How about one of these alternatives:

  1. use a directory tree. Literally create sub directories for each component of the URL.
  2. Generate a uniques id. The problem here is how to keep the mapping between real name and saved id. You could use a database which maps between a URL and generated unique id. You can simply insert a record into a database which generates unique ids, and then use that id as the filename.
djna
  • 54,992
  • 14
  • 74
  • 117
  • I did think about using the database for this. – Jacques René Mesrine Oct 27 '09 at 08:57
  • All performance is relative - slipping one extra record to a local database probably compares pretty well with downloading an image. Sure it's not the fastest possible thing that could be done, but I would favour the simplest thing that could work until it's proven too slow. – djna Oct 27 '09 at 11:07
4

One of the key concepts of a URL is that it is unique. Why not use it?

Every algorithm that shortens the info, can produce collisions. Maybe unlikely, but possible nevertheless

Peter
  • 47,963
  • 46
  • 132
  • 181
  • It looks like he's using sth correlated with twitter – guerda Oct 27 '09 at 08:36
  • 2
    This is the simplest approach, but he would need to watch out for the 255 character path limit on some OS (ie XP). Note the actual URL may be less than 255, but combined with the parent folder/s it may be longer and this is painful. Some URLs can be ridiculously long! – Ash Oct 27 '09 at 08:44
  • The _path_ limit on XP is 32767. Not all filesystems support it (e.g. CD-ROMs generally don't), individual _names_ in the path are limited to 255 chars, and you may need to use the full internal path name with the `\\?\` prefix with some APIs. – MSalters Jun 16 '11 at 10:14
3

You can use UUID Class in Java to generate anything into UUID from bytes which is unique and you won't be having a problem with file lookup

String url = http://www.google.com;
String shortUrl = UUID.nameUUIDFromBytes("http://www.google.com".getBytes()).toString();
Vasumithra
  • 165
  • 6
2

While CRC32 produces a maximum 2^32 values regardless of your input and so will not avoid conflicts, it is still a viable option for this scenario.

It is fast, so if you generate filename that conflicts, just add/change a character to your URL and simply re-calc the CRC.

4.3 billion possible checksums mean the likelihood of a filename conflict, when combined with the original filename, are going to be so low as to be be unimportant in normal situations.

I've used this approach myself for something similar and was pleased with the performance. See Fast CRC32 in Software.

Ash
  • 60,973
  • 31
  • 151
  • 169
1

I see your question is what is the best hash algorithm for this matter. You might want to check this Best hashing algorithm in terms of hash collisions and performance for strings

Community
  • 1
  • 1
Svetlozar Angelov
  • 21,214
  • 6
  • 62
  • 67
1

The git content management system is based on SHA1 because it has very minimal chance for collision.

If it good for git it will be good to you so.

Vereb
  • 14,388
  • 2
  • 28
  • 30
  • No cryptographic algos, see the question. – guerda Oct 27 '09 at 08:38
  • This is 2009 I can't imagine it is slow for short url-s. – Vereb Oct 27 '09 at 08:49
  • I know, but if question opener don't want to have cryptographic algos, your answer does not help. – guerda Oct 27 '09 at 08:54
  • This app runs on mobile phones. – Jacques René Mesrine Oct 27 '09 at 08:58
  • 1
    I agree with you in that it was in the problem description not to use it. Now I see it will be on mobile phone so I guess you don't need to parse thousands of urls/sec. A mobile can play videos which needs more computation. I just say it worth to measure the performance with sha1 with an existing library instead of a difficult method. – Vereb Oct 27 '09 at 09:08
  • Unfortunately floating point calculation is still expensive in managed environments on phones (it is normally emulated using integral arithmetic). Video playback is normally handled by native code. There is a difference. – Jacques René Mesrine Oct 27 '09 at 09:11
1

I'm playing with thumbalizr using a modified version of their caching script, and it has a few good solutions I think. The code is on github.com/mptre/thumbalizr but the short version is that is uses md5 to build the file names, and it takes the first two characters from the filename and uses it to create a folder which is named the exact same thing. This means that it is easy to break the folders up, and fast to find the corresponding folder without a database. Kind of blew my mind with it's simplicity.

It generates file names like this http://pappmaskin.no/opensource/delicious_snapcasa/mptre-thumbalizr/cache/fc/fcc3a328e0f4c1b51bf5e13747614e7a_1280_1024_8_90_250.png

the last part, _1280_1024_8_90_250, matches the different settings that the script uses when talking to the thumbalizr api, but I guess fcc3a328e0f4c1b51bf5e13747614e7a is a straight md5 of the url, in this case for thumbalizr.com

I tried changing the config to generate images 200px wide, and that images goes in the same folder, but instead of _250.png it is called _200.png

I haven't had time to dig that much in the code, but I'm sure it could be pulled apart from the thumbalizr logic and made more generic.

Morten Skogly
  • 71
  • 1
  • 8
0

You said:

I don't want a cryptographic algorithm as it this needs to be a performant operation.

Well, I understand your need for speed, but I think you need to consider drawbacks from your approach. If you just need to create hash for urls, you should stick with it and don't to write a new algorithm, where you'll need to deal with collisions, for instance.

So you could have a Dictionary<string, string> to work as a cache to your urls. So, when you get a new address, you first do a lookup in that list and, if doesn't find a match, hash it and storage for future usage.

Following this line, you could give MD5 a try:

public static void Main(string[] args)
{
    foreach (string url in new string[]{ 
        "http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg", 
        "http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg" })
    {
        Console.WriteLine(HashIt(url));
    }
}

private static string HashIt(string url)
{
    Uri path = new Uri(new Uri(url), ".");
    MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider();
    byte[] data = md5.ComputeHash(
        Encoding.ASCII.GetBytes(path.OriginalString));
    return Convert.ToBase64String(data);
}

You'll get:

rEoztCAXVyy0AP/6H7w3TQ==
0idVyXLs6sCP/XLBXwtCXA==
Rubens Farias
  • 57,174
  • 8
  • 131
  • 162
0

It appears that the numerical part of twimg.com URLs are already a unique value for each image. My research indicates that the number is sequential (i.e. the example url below is for the 433,484,366th profile image ever uploaded - which just happens to be mine). Thus, this number is unique. My solution would be to simply use the numerical part of the filename as the "hash value", with no fear of ever finding a non-unique value.

  • URL: http:​//a2.twimg.com/profile_images/433484366/terrorbite-industries-256.png
  • Filename: 433484366.terrorbite-industries-256.png
  • Unique ID: 433484366

I already use this system for a Python script that displays notifications for new tweets, and as part of its operation it caches profile image thumbnails to reduce unneccessary downloads.

P.S. It makes no difference what subdomain the image is downloaded from, all images are available from all subdomains.

TerrorBite
  • 352
  • 2
  • 6