1

I'm downloading a lot of files whose URLs are listed in a text file.

When saving a file to disk, I use the MD5 checksum of its URL as the new filename. This is to avoid file name conflicts and invalid characters in the original file name.

But I also need a way to find the original URL from a downloaded file name, if I use MD5, I'll have to use a mapping that's very huge.

Is there any algorithm I can use instead that allow me to just decode the original URL from the file name?

Note that I also don't want the length of file names to vary to much.

satoru
  • 31,822
  • 31
  • 91
  • 141
  • To avoid invalid characters, escape them as their hex values, e g. `/ -> %2F, % -> %%`. Hashing and encrypting won't help you avoid name conflicts. Instead, you could add a serial number to each file name. – n. m. could be an AI Jul 06 '16 at 05:44
  • Why do you need a fixed-length string as a result? – zerkms Jul 06 '16 at 05:49
  • Do you need the url to be a secret? – printfmyname Jul 06 '16 at 05:52
  • @printfmyname No, I just want the file name to contain only valid characters, and each URL should results in a unique name. – satoru Jul 06 '16 at 05:58
  • @zerkms You may consider that as a "nice-to-have" feature. – satoru Jul 06 '16 at 06:04
  • You already have the list of URLs in a text file. So couldn't you just prefix your filenames with the position of the original URL in this file? 0000-_HASH0_, 0001-_HASH1_, etc. The prefix also guarantees uniqueness in the case of a hash collision (even if it's very unlikely). Not sure if you still need the hash at all if you're using a counter, though. – Arnauld Jul 06 '16 at 06:26
  • @Arnauld Yes, I can use that. But the problem is that I may have a large amount of links in a big file, I don't want to have to scan the file to find the URL for file No.123123112 – satoru Jul 06 '16 at 06:30
  • That makes sense. I'd vote for base62 too, granted that your URLs are not too long. For long URLs, you may consider using something like lzbase62. You can test it online [here](http://polygonplanet.github.io/lzbase62/demo/) – Arnauld Jul 06 '16 at 06:35
  • @satoru given that URLs length are not restricted by anything - what "the same" length you're expecting to have? – zerkms Jul 06 '16 at 06:44
  • @satoru Apparently, there is also a 'known maximum length' for these specific URLs. So, what is this maximum length? – Arnauld Jul 06 '16 at 07:25
  • 1
    So why not use the *offset* of the URL in the text file? Then you can just seek to it if you want to decode the URL. – Niklas B. Jul 06 '16 at 08:20

2 Answers2

0

You can use base62, which is file system friendly and can be en-/decrypted. But you can't avoid file name collisions. If you want to avoid them too, you could append a MD5 of the file to the encrypted filename, and remove the MD5 when decrypting.

xdevs23
  • 3,824
  • 3
  • 20
  • 33
0

If you want a generic solution look for short string compression algorithms. Here's a previously answered question about it An efficient compression algorithm for short text strings. There's no way to grantee that you get equal length strings because some of them will compress better than others.

Since you are dealing with only html you can use that to store some data. For example you can simply put the original URL in front of the leading HTML tag or after the closing HTML tag. Or add a special tag or attribute to the file to store this information. Then you can keep MD5 as the file name, but if you need the url you would open the file and look for it there. This should allow you to store the data without affecting any use of the file and without having to store a large mapping table.

Community
  • 1
  • 1
Sorin
  • 11,863
  • 22
  • 26