3

I'm in need of storing a large number, millions, of files on disk. I want to use a sharding directory structure so there is no more than a thousand files in a directory. If I use 3 directories deep, I can get a billion files (1000^3).

My math is a little rusty and I'm not sure the correct formula to figure out what directory a file would land in given an integer key for the file.

For example, file '0010.pdf' would land in the directory '0000\0000\0000\0010.pdf'. File '2010.pdf' would go in '0000\0000\0002\0010.pdf'. So the structure is '{level 1}{level 2}{level 3}{file}'.

How do I mathematically figure out the various levels? I'm looking for a formula. C# code would be even better, but I can derive that from the formula if need be.

Edit

I converted the answer below to a c# function.

public static string Shard(long key, string extension, int maxFiles = 1000, int depth = 3)
{
    var parts = new List<string>();
    long current = key;

    for (int i = depth; i > 0; i--)
    {
        long q = Convert.ToInt64(Math.Pow(maxFiles, i));
        long level = current / q;

        parts.Add(string.Format("{0:0000}", level));

        current = current % q;
    }

    parts.Add(string.Format("{0:0000}{1}", current, extension));

    string separator = Path.DirectorySeparatorChar.ToString(CultureInfo.InvariantCulture);
    string path = string.Join(separator, parts);

    return path;
}
Paul Welter
  • 122
  • 1
  • 7
  • Are you planning to store too many files on the disk? NTFS performance will reduce drastically. Sharding will only help on ReFS disk formate. – Akash Kava Nov 22 '13 at 19:00
  • My understanding is that you can avoid performance issues by sharding into sub folders. [link](http://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories) – Paul Welter Nov 22 '13 at 19:45
  • 1
    http://technet.microsoft.com/en-us/library/cc781134.aspx, NTFS stores all file attributes in one MFT structure irrespective of directory structure. This is the whole point why MS invested in ReFS which has hierarchical MFT where each directory has its own children table. http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-generation-file-system-for-windows-refs.aspx , however till 100,000 files you will not face any issue. – Akash Kava Nov 22 '13 at 20:19
  • 2
    There are more reasons to shard then if your filesystem will support it. You may need it for backup software limits, preventing filesystem events from getting overloaded, issues with file preview generation, mirroring to a more limited filesystem, ability to browse to files reasonably in Explorer or Finder, ability to share the folder via SMB, support for libraries that have trouble iterating over an extremely large number of files, porting to another operating system, etc. I would always shard large collection of files to no more then 5000 files each folder to avoid some of these issues. – Eric Nov 30 '17 at 02:55

3 Answers3

2

Divide by 1000^3 = 1000000000 (mod by 1000 - does nothing) to get the first-level directory.

Divide by 1000^2 = 1000000, mod by 1000, to get the second-level directory.

Divide by 1000, mod by 1000, to get the third-level directory.

Mod by 1000 to get the file.

Note how this can essentially just be done with a for-loop from 1000^3, dividing by 1000 at every step.

Example:

Input: 123456789012

123456789012 / 1000000000     = 123
123456789012 / 1000000 % 1000 = 456
123456789012 / 1000 % 1000    = 789
123456789012 % 1000           = 012

Directory / file: 0123/0456/0789/0012

Or, doing it iteratively:
(removing the % 1000 and modifying the number and modding on the previous step instead)

Input: 123456789012

123456789012 / 1000000000 = 123
123456789012 % 1000000000 = 456789012

456789012    / 1000000    = 456
456789012    % 1000000    = 789012

789012       / 1000       = 789
789012       % 1000       = 012

Taking the result of every division, and the final mod result:

Directory / file: 0123/0456/0789/0012

Additional note:

You can probably get rid of one of the digits in each level of your structure - since you only have 0-999, there's no point to having 4 digits.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
2

Since you want a string, treat it as a string:

private string MakePath(Int32 key)
{
    // make 9-digit string, pad left with 0
    string s = n.ToString().PadLeft(9, '0');

    // insert backslashes
    return s.Substring(0, 3) + "\\" + 
           s.Substring(3, 3) + "\\" + 
           s.Substring(6, 3);
}

There are more elegant ways of coding this, of course.

egrunin
  • 24,650
  • 8
  • 50
  • 93
0

You are describing a 3 level deep hash. The most obvious way to implement this is to construct 3 different hashing algorithms each of which takes a string and returns a unique number from 0 to 999 at each level.

Depending on how large and how evenly distributed the integer values are for each file, you could simply use a trivial hash if the integer values for each file are unique and less than a billion.

http://en.wikipedia.org/wiki/Hash_function

If you are asking how to get 0123 from 0123,993,456 simply do an integer divide by 1,000,000.

You get 993 by taking the mod 1,0000,000 and then integer divide by 1000, etc.

pry
[1] pry(main)> foo = 123993456
 => 123993456
[2] pry(main)> foo / 1000000
 => 123
[3] pry(main)> foo % 1000000
 => 993456
[4] pry(main)> foo % 1000000 / 1000
 => 993
[5] pry(main)> foo % 1000
=> 456