11

I am getting thousands of pictures uploaded by thousands of users on my Linux server, which is hosted by 1and1.com (I believe they use CentOS, but am unsure of the version). This is a language agnostic question, however, for your reference, I am using PHP.

My first thought was to just dump them all in the same directory, however, I remember a little while ago, there was a limit to how many files or directories could be dropped in a directory.

My second thought was to partition the files inside directories based on the users email address (as it is what I am using for the user name anyhow) but I don't want to run into the limit for directories in a directory....

Anyhow, for images from user@domain.com, I was going to do this:

/images/domain.com/user/images...

Is this smart to do, what if thousands of users have say 'gmail' perhaps I could even go deeper, like this

/images/domain.com/[first letter of user name]/user/images...

so for mike@gmail.com it would be...

/images/domain.com/m/mike/images...

Is this a bad approach? What is everyone else doing? I don't want to run into problems with too many directories also...


Related:

Community
  • 1
  • 1
Kladskull
  • 10,332
  • 20
  • 69
  • 111

6 Answers6

28

I would do the following:

  1. Take an MD5 hash of each image as it comes in.
  2. Write that MD5 hash in the database where you are keeping track of these things.
  3. Store them in a directory structure where you use the first couple of bytes of the MD5 hash hex string as the dir name. So if the hash is 'abcdef1234567890' you would store it as 'a/b/abcdef1234567890'.

Using a hash also lets you merge the same image uploaded multiple times.

Joe Beda
  • 2,743
  • 1
  • 19
  • 16
  • 2
    A couple of comments: 1) salt your hash with a known value. 2) "Tree-balance" the hash into the folder structure. So take the first (say) five chars of the hash and make it a folder, then the next five, etc. So you never have more than 100,000 folders in any given folder. Use the entire hash in the folder structure in this way. – Steve Midgley Nov 10 '14 at 18:18
  • @SteveMidgley Can you tell me more about "Tree Balancing"? How did you calculate that there won't be more than 100,000 folders in any given folder? – Istiak Tridip Jul 08 '18 at 03:18
  • A tree balanced structure (https://en.wikipedia.org/wiki/B-tree) means that there are equal parts at every level of the structure. The size of the balance should be based on the number of elements your system can handle at any given level (how many folders+files are allowed in a folder?) vs the level of depth your system can handle (how many levels deep of sub-folders is reasonable for your design?). – Steve Midgley Jul 09 '18 at 17:33
  • @IstiakTridip - So your algorithm would take a hash a split it up into equal parts, and the first part (of say 4 digits) would be a folder, and then the next part would be a subfolder, repeating until you're out of hash string. The size of the hash you use would be driven by the above width/depth questions for your application/system/os. – Steve Midgley Jul 09 '18 at 17:44
  • 1
    @SteveMidgley 16 ^ 5 is one million ... probably rather branch by 3 or 4 **maximum**. – Antti Haapala -- Слава Україні Nov 13 '18 at 18:39
  • @AnttiHaapala - yes good point. Either convert the hex hash to decimal and truncate at 5, or as you say truncate to 3 or 4 chars of hex values.. – Steve Midgley Nov 14 '18 at 00:16
4

to expand upon Joe Beda's approach:

  • database
  • database
  • database

if you care about grouping or finding files by user, original filename, upload date, photo-taken-on date (EXIF), etc., store this metadata in a database and use the appropriate queries to pick out the appropriate files.

Use the database primary key — whether a file hash, or an autoincrementing number — to locate files among a fixed set of directories (alternatively, use a fixed maximum-number-of-files N per directory, and when you fill up go to the next one, e.g. the kth photo should be stored at {somepath}/aaaaaa/bbbb.jpg where aaaaaa = floor(k/N), formatted as decimal or hex, and bbbb = mod(k,N), formatted as decimal or hex. If that's too flat a hierarchy for you, use something like {somepath}/aa/bb/cc/dd/ee.jpg)

Don't expose the directory structure directly to your users. If they are using web browsers to access your server via HTTP, give them a url like www.myserver.com/images/{primary key} and encode the proper filetype in the Content-Type header.

Jason S
  • 184,598
  • 164
  • 608
  • 970
  • all the images will be below the root of the web folder, so that they can not access them without using our function to retrieve them. – Kladskull May 23 '09 at 16:42
  • still, if you make the structure they access them from, coupled to the structure you store them in, then you're stuck w/o changing the URL. If you decouple, you can change the storage structure later if necessary. – Jason S May 23 '09 at 21:27
3

What I used for another requirement but which can fit your needs is to use a simple convention.

Increment by 1 and get the length of the new number, and then prefix with this number.

For example:

Assume 'a' is a var which is set with the last id.

a = 564;
++a;
prefix = length(a);
id = prefix + a; // 3565

Then, you can use a timestamp for the directory, using this convention:

20092305 (yyyymmdd)

Then you can explode your path like this:

2009/23/05/3565.jpg

(or more)

It's interesting because you can keep a sort order by date, and by number at the same time (sometimes useful) And you can still decompose your path in more directories

rybo111
  • 12,240
  • 4
  • 61
  • 70
Boris Guéry
  • 47,316
  • 8
  • 52
  • 87
3

Here are two functions I wrote a while back for exactly this situation. They've been in use for over a year on a site with thousands of members, each of which has lots of files.

In essence, the idea is to use the last digits of each member's unique database ID to calculate a directory structure, with a unique directory for everyone. Using the last digits, rather than the first, ensures a more even spread of directories. A separate directory for each member means maintenance tasks are a lot simpler, plus you can see where's people's stuff is (as in visually).

// checks for member-directories & creates them if required
function member_dirs($user_id) {

    $user_id = sanitize_var($user_id);

    $last_pos = strlen($user_id);
    $dir_1_pos = $last_pos - 1;
    $dir_2_pos = $last_pos - 2;
    $dir_3_pos = $last_pos - 3;

    $dir_1 = substr($user_id, $dir_1_pos, $last_pos);
    $dir_2 = substr($user_id, $dir_2_pos, $last_pos);
    $dir_3 = substr($user_id, $dir_3_pos, $last_pos);

    $user_dir[0] = $GLOBALS['site_path'] . "files/members/" . $dir_1 . "/";
    $user_dir[1] = $user_dir[0] . $dir_2 . "/";
    $user_dir[2] = $user_dir[1] . $dir_3 . "/";
    $user_dir[3] = $user_dir[2] . $user_id . "/";
    $user_dir[4] = $user_dir[3] . "sml/";
    $user_dir[5] = $user_dir[3] . "lrg/";

    foreach ($user_dir as $this_dir) {
        if (!is_dir($this_dir)) { // directory doesn't exist
            if (!mkdir($this_dir, 0777)) { // attempt to make it with read, write, execute permissions
                return false; // bug out if it can't be created
            }
        }
    }

    // if we've got to here all directories exist or have been created so all good
    return true;

}

// accompanying function to above
function make_path_from_id($user_id) {

    $user_id = sanitize_var($user_id);

    $last_pos = strlen($user_id);
    $dir_1_pos = $last_pos - 1;
    $dir_2_pos = $last_pos - 2;
    $dir_3_pos = $last_pos - 3;

    $dir_1 = substr($user_id, $dir_1_pos, $last_pos);
    $dir_2 = substr($user_id, $dir_2_pos, $last_pos);
    $dir_3 = substr($user_id, $dir_3_pos, $last_pos);

    $user_path = "files/members/" . $dir_1 . "/" . $dir_2 . "/" . $dir_3 . "/" . $user_id . "/";
    return $user_path;

}

sanitize_var() is a supporting function for scrubbing input & ensuring it's numeric, $GLOBALS['site_path'] is the absolute path for the server. Hopefully, they'll be self-explanatory otherwise.

da5id
  • 9,100
  • 9
  • 39
  • 53
2

Joe Beda's answer is almost perfect, but please note that the MD5 has been proven to be collidable in iirc 2 hours on a laptop?

That said, if You actually will use the file's MD5 hash in the described way, Your service will become vulnerable to attacks. How will the attack look like?

  1. A hacker doesn't like a particular photo
  2. He ensures that this is plain MD5 that You are using (MD5 of image+secret_string can scare him out)
  3. He uses a magic method of colliding a picture of (use Your imagination here) hash with the photo he doesn't like
  4. He uploads the photo like he would normally do
  5. Your service overwrites the old one with the new one and displays both

Someone says: let's not overwrite it then. Then, if it's possible to predict that someone will upload something (f.e. a popular picture on the web might get uploaded), it's possible to take the "hash-place" of it first. User would be happy when uploading a picture of a kitty, He would find that it actually appears as (use Your imagination here). I say: use SHA1, as it's been proven to be hackable in iirc 127 years by a 10.000 computers cluster?

Paweł Polewicz
  • 3,711
  • 2
  • 20
  • 24
  • you're talking about a preimage attack, which hasn't been successful yet against MD5, only collision attacks http://www.vpnc.org/hash.html – Jason S May 23 '09 at 21:29
  • 1
    http://en.wikipedia.org/wiki/MD5 : "On 1 March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne de Weger demonstrated construction of two X.509 certificates with different public keys and the same MD5 hash, a demonstrably practical collision." (...) – Paweł Polewicz May 23 '09 at 22:28
0

Might be late to the game on this. But one solution (if it fits your use-case) could be file name hashing. It is a way to create an easily reproducible file path using the name of the file while also creating a well distributed directory structure. For example, you can use the bytes of the filename's hashcode as it's path:

String fileName = "cat.gif";
int hash = fileName.hashCode();
int mask = 255;
int firstDir = hash & mask;
int secondDir = (hash >> 8) & mask;

This would result in the path being:

/172/029/cat.gif

You can then find cat.gif in the directory structure by reproducing the algorithm.

Using HEX as the directory names would be as easy as converting the int values:

String path = new StringBuilder(File.separator)
        .append(String.format("%02x", firstDir))
        .append(File.separator)
        .append(String.format("%02x", secondDir)
        .toString();

Resulting in:

/AC/1D/cat.gif

I wrote an article about this a few years ago and recently moved it to Medium. It has a few more details and some sample code: File Name Hashing: Creating a Hashed Directory Structure. Hope this helps!

Michael Andrews
  • 828
  • 8
  • 13