Selecting a file randomly from a file system

Question

This question relates to Simulating file system access .

I need to choose files and directories randomly to act as arguments of file operations like rename, write, read etc. What I was planning to do was to make a list of all files and directories with thier paths and randomly select from this list. But, as files and directories are created and deleted in the actual file system, the list also has to be updated. I am finding maintaining the list and updating it in this manner to be inefficient and it also has to be atomic so that a later operation does not access a file that was deleted by a previous operation.

Can you suggest a different way of selecting the files ..maybe someway to do it diretly from the file system...but how would we know paths to files then.

Thanks

I found something interesting here Randomly selecting a file from a tree of directories in a completely fair manner specially in Michael J. Barber's answer, but not being able to follow it completely due to my python ignorance

My answer pretty much describes his interim solution. Randomizing the max depth should help. I think the biggest difference with that question is that it involves a subtree of the filesystem. I wouldn't want to constantly walk from the root directory just to be sure that every file had an equal chance of being picked. — Keith Layne, Sep 02 '11 at 06:01
The operation doesn't have "to be atomic" -- after selecting a file (from the cached list), it can be tested for existence. The other bit (without using various notification systems) can get rather expensive. — , Sep 02 '11 at 06:02
@pst...ahh, did not think of a simple hing like testing for existence. thanks. Did not get what you meant by , "The other bit (without using various notification systems) can get rather expensive" — Lipika Deka, Sep 02 '11 at 06:09
@pst How come you get the upvote when I had alread said that? :) — Keith Layne, Sep 02 '11 at 06:57

score 3 · Accepted Answer · answered Sep 02 '11 at 05:56

3

You don't want to try to maintain a list of files when the filesystem is right there. You should be able to do it right from C. Walk from the root directory, selecting a random file from it. You can pick a random maximum depth, and if you hit a regular file, at or before this, use it. If it's a directory, repeat up to max depth. If it's a special file, maybe start over.

This should be pretty quick. The operation shouldn't have to be atomic. If the file's not there when you want do your operation, try again. Shouldn't be too complicated. You can build the path up as you find your target file. This will be simpler than fussing with the fs directly (I assume you meant at a much lower level) and should be simple to implement.

answered Sep 02 '11 at 05:56

Keith Layne

3,688
1
24
28

1

I think the distribution will vary greatly depending upon the structure. – Sep 02 '11 at 06:04
1

@keith.layne How will it be random? I mean when I walk from root I will always walk the same path going down the same subtree and hence selecting files from certain subtrees will have much higher probability the the rest. Infact some files may never be chosen. I may not have understood your approach. – Lipika Deka Sep 02 '11 at 06:14
1

Why would you always go along the same path if you choose from the root at random? Granted, random usually isn't random on a computer. Random means each file has exactly the same chance of getting picked. I'm saying that is too costly a goal. If you're simulating file accesses, then you just need to be random enough to meet your simulation requirements. – Keith Layne Sep 02 '11 at 06:51
1

@pst You're absolutely right...you could maybe modify my algorithm to have a randomized target depth instead of max depth. That might help. The question is a reasonable upper bound if you don't walk the whole tree. – Keith Layne Sep 02 '11 at 06:52
1

This is not fully random: if `/` contains two directories, `/foo` and `/bar`, containing respectively one file and one million files, then `/foo/alone` will have a probability far greater than `/bar/oneGrainOfSandOnTheBeach`. +1 anyway for directly using the filesystem. – mouviciel Sep 02 '11 at 07:22
@mouviciel You're right of course, maybe I could have worded my acknowledgement of that fact better. I was going for small memory use, quick running time, and ease of implementation, with (hopefully) acceptable distribution of hits. Of course, a *fully* random solution would be tough because a mounted fs in use is a moving target. – Keith Layne Sep 02 '11 at 07:29

score 1 · Answer 2 · edited May 23 '17 at 12:10

Here is my proposed solution. It is not the fastest, but should be quick (after preparation), use only modest memory, and be "fairly well-distributed". This is, of course, 100% untested and somewhat complex (as complex as maintain a RB-tree or similar, anyway) -- I pitty one for having to use C ;-)

For each directory in the target domain, build a directory tree using a depth-first walk of the filesystem and record the "before" file count (files found to date, in tree) and the "after" file count (the "before" count plus the number of files in directory). It should not store the files themselves. Fast way to find the number of files gives some example C code. It still requires iteration of the directory contents but does not need to store the files themselves.
Count up the total number of files in the tree. (This should really just be the "after" count of the last node in the tree.)
Pick a random number in the range [0, max files).
Navigate to the node in the tree such that the "before" file count <= random number < "after" file count. This is just walking down the (RB-)tree structure and is itself O(lg n) or similar.
Pick a random file in the directory associated with the selected node -- make sure to count the directory again and use this as the [0, limit] in the selection (with a fallback in case of running-off-the-end due to concurrency issues). If the number of files changed, make sure to update the tree with such information. Also update/fix the tree if the directory has been deleted, etc. (The extra full count here shouldn't be as bad as it sounds, as readdir (on average) must already be navigated through 1/2 the entries in the directory. However, the benefit of the re-count, if any, should be explored.)
Repeat steps 2-5 as needed.

Periodically rebuild the entire tree (step #1) to account for filesystems changes. Deleting/adding files will slowly skew the randomness -- step #5 can help to update the tree in certain situations. The frequency of rebuild should be determined through experimentation. It might also be possible to reduce the error introduction with rebuilding the parent/grandparent nodes or [random] child nodes each pass, etc. Using the modified time as a fast way to detect changes may also be worth looking into.

Happy coding.

On my pretty small VM installation of linux, fsck tells me `/dev/sdb1: 340669/8536064 files`. I can't remember which number is significant, but I'd bet it's the second one. How long to do a DFS on that? Why simulate FS access if it's not on the whole FS? Did I misunderstand, and it's just a subtree? I was just going for *decent* distribution with tiny time and space requirements. — Keith Layne, Sep 02 '11 at 07:01
@keith.layne The entire tree structure (for whatever root is desired, it might only be required on `/opt` for all I know and it's no different than using that as a root than `/`, just a start of a tree ;-) is read -- but only the structure itself and file counts are kept. It only requires standard POSIX API so it might be relatively slow vs. a more "hands on" approach with the raw filesystem or specific [kernel-level] features. As far as "how long", it depends upon environmental factors (that step is *not cheap* - changed post wording) which is why amortized correction may be beneficial. — , Sep 02 '11 at 07:09
@keith.layne For what it is worth, took 30 seconds for WinDirStat to scan 197k files in 38k directories in Windows 7 on an SSD. — , Sep 02 '11 at 07:16
@pst Yeah, I was trying to get OP to clarify. I'm tracking the POSIX fs calls (after all he tagged it `linux`), and as I mentioned too I wouldn't mess with the raw filesystem either. To be clear, I agree your algorithm is way better than mine *if randomness is the most important factor* or *if the DFS doesn't take so long to prepare* that you fall asleep waiting for it. Not convinced that amortization (like you said, benefit depends on many env factors) discounts enough to make it practical...if @Juggler gave us a rough file count, it would help. — Keith Layne, Sep 02 '11 at 07:22
@pst That's pretty darn good...curious how it would do with 10x files on a spinning disk...any chance you can provide that? With load around 1.0? — Keith Layne, Sep 02 '11 at 07:24
@pst Good for you! I would be too if I weren't at the house. — Keith Layne, Sep 02 '11 at 07:32

score 1 · Answer 3 · answered Sep 02 '11 at 07:23

All you should know is how many files are in each directory in order to pick directory in which you should traverse. Avoid traversing over symbolic links and counting files in symbolic links.

You can use similar solution as pst described.

Example you have 3 directories and there are 20,40 and 1000 files in each. You make total [20,60,1060] and you pick random number 0-1060. if this number if greater or equal 60 you go 3rd folder.

You stop traversing once you reach folder whitout folders.

To find random file trough this path you can apply same trick as before.

This way you will pick any file whit equal probability.

Selecting a file randomly from a file system

3 Answers3