0

My app generates thousands of files, and I'd like to evenly distribute them into some dirs.

How the files are divided should be predictable, as I have a client which requests them to the server. I.e. if I have file "100002.xml", I need to know under which dir it is located.

The files have names like ids in a database, like 1.xml, 2.xml, 1000000.xml and so on. There can be big holes between numbers, so I can have files 1-1000 and then 100000-199999

Last time I had many files beginning with 1, so creating dirs like 0-9 doesn't work, because almost all file would go into the "1" dir.

I cannot think to a method to evenly distribute the files, how could I do?

I can also accept to have dirs with no more than n files.

I'm able to make a script to split the files into multiple dirs but not in a way that is predictable. I'd like to create as fewer dirs as possible.

Edit: my client cannot search nor I cannot have a script on the server to handle the requests: I have a javascript method which get a file from an Apache server and I cannot have any script to handle the requests

Edit 2: I think that my question really is: what hash function can I use to map integers to a uniform distribution of integers even if the source integers are not uniformly distributed?

cdarwin
  • 4,141
  • 9
  • 42
  • 66
  • `As fewer dirs as possible` maybe you can do something similar to merge sort? – nice_dev Nov 06 '18 at 18:29
  • I don't uderstand your suggestion: I can split recursively the files until I've done, but how can then know at the client side under which dir a file has gone? Can you make an example? – cdarwin Nov 06 '18 at 18:38
  • Should files be stored in some dir[s] while your app is running? Or you know in advance how many files and their names? – Ripi2 Nov 06 '18 at 18:52
  • @Ripi2 The app should in theory store the files while generating them. I can know how many files I'll have, but not their names in advance. However, now that you ask me, I can split the files even after having generated all of them, so I can also know their names. – cdarwin Nov 06 '18 at 18:59
  • Then put all names in a sorted array and slice it into the number of dirs you wish. – Ripi2 Nov 06 '18 at 19:01
  • @cdarwin Yes, like [this](https://imgur.com/a/P1koI4b) .You can have dir names as `lowest_id_value-highest_id_value`. Now, if client wants to get the dir name of a file, he can just move left and right of the tree to get those.The moment he realizes that any dir name on either side of the tree doesn't come under that range, he can stop search in that direction.Technically, every directory will have only 2 sub dirs (at the next highest level).You don't need to create a tree,just the idea.The tree will always be balanced giving you a performance of log(n) on queries & minimum directories to make – nice_dev Nov 06 '18 at 19:03
  • @Ripi2 But, given the name of a file, how can I then know under which dir it is? – cdarwin Nov 06 '18 at 19:05
  • You need another app which allow this file-where query. Keep the sorted array in memory. Look for the file name in this array, get the index. A simple formula gives you the number of the slice. You may have another array with slices-dirs relations. – Ripi2 Nov 06 '18 at 19:11
  • @cdarwin Just a correction. Total directories made in this case will always be `N - 1` where `N` is total no. of files. However, you could put 10-100 different files in a single directory and set the range of the higher folder as it's name and follow the same logic as mentioned above to reduce the directories to make. – nice_dev Nov 06 '18 at 19:11
  • @vivek_23 and Ripi2 the problem is that I need something like an hash function which given the name gives me the dir. I have a javascript method which requests a file to Apache knowing only its name, and on the apache side I can have no script to handle the request – cdarwin Nov 06 '18 at 19:13
  • @cdarwin I explained you how to query for the directory given the file name in my previous comments. – nice_dev Nov 06 '18 at 19:14
  • @vivek_23 you said the client should search, so, in my case, I imagine the client should issue more than one request to locate the file, correct? This should be avoided – cdarwin Nov 06 '18 at 19:17
  • @cdarwin Client will make only 1 request. It's the server's job to get the directory and return it to the client. – nice_dev Nov 06 '18 at 19:18

1 Answers1

1

How about using the modulo operation? That's a very primitive way to do a hash function. Imagine you have n files and directories with at most m files (and n>m). Assume the file id i, i % (n/m) will give you which of the n/m directories you will store.

  • You're right, I think now that what I really need is a hash function. But how uniform is your function? I'm looking at [this answer](https://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key) – cdarwin Nov 06 '18 at 21:41
  • It really depends on the ids of your data. Let's say you do something simple. You create dirs 0-9 and you use a `id % 10` to assign the file with id `id` to the dir. If you use many continuous ids they will all get uniformity distributed. Assume ids 121-190.121 will go to 1, 122 will go to 2, 123->3... 130->0, 131->1... If you start then another "block" like 209-300, 209->9, 210->0.... As you can see they are pretty uniform. – Jorge Lopez Nov 06 '18 at 22:17