6

Suppose I have a very large number of directories (say 100.000) in my filesystem and inside each directory there is a similar number of directories. Each directory can contain any number of files, but typically not more than a few. This structure goes to a constant depth (10).

My question is that is there a difference in time complexity (in the reading operation) if I read in a file from this directory structure like: /dir-34/dir-215/dir-345/file1 using Paths.get() compared to reading a file form a simple file system like this:

/dir1
  /dir2
  /dir3
    file1
  /dir4
    file2

Note: This is just a theoretical question I just want to know whether the number of directories/files in the directory I'm trying to open a file from has any effect on the speed of the read operation.

Adam Arold
  • 29,285
  • 22
  • 112
  • 207
  • 2
    It's not clear what you're comparing here. In both cases it sounds like you have a nested directory structure... – Oliver Charlesworth Dec 18 '14 at 17:52
  • 2
    Also, by "time complexity" do you mean big-O or something, or are you just talking about "run time"? – Oliver Charlesworth Dec 18 '14 at 17:52
  • 100.000 dirs and each (!) contains 100.000 - and this to a level of 10? googol? – laune Dec 18 '14 at 17:55
  • 1
    I'm comparing loading time of a file from a nested directory structure where there is `10 * 100.000` directories to a directory structure where there is only 10`. I'm interested in the time complexity of the read operation in terms of number of directories. – Adam Arold Dec 18 '14 at 17:56
  • Then look at the data structures involved. For linux [inode](http://en.wikipedia.org/wiki/Inode)s are a good start. – keyser Dec 18 '14 at 17:58
  • It is just a linear progression. Each extra file adds X amount of overhead. Especially if you are using discovery and not just loading from a list. – markbernard Dec 18 '14 at 17:58
  • This is just a theoretical question I just want to know whether the number of directories/files in the directory I'm trying to open a file from has any effect on the speed of the read operation. – Adam Arold Dec 18 '14 at 17:58
  • It depends on the order of access. You need to access the inode of a directory, the directory data blocks, inode of a file (if small, this points to all data blocks). So it depends on the way you traverse the directory tree. - And your "difference" isn't any difference I can see, – laune Dec 18 '14 at 18:00
  • Reading a directory in *nix systems is cheap unless the number of entries is big - which it isn't in your case. And it most certainly doesn't affect the read time of the file. – laune Dec 18 '14 at 18:02
  • I would wager there would be a dependency on which filesystem you are currently using (some choke with a lot of tiny files/directories, others blaze ahead) – SnakeDoc Dec 18 '14 at 18:05
  • 2
    It depends on the filesystem and its options. e.g. Ext[234] has a `dir_index` option, which means that it will use H-trees for storing (filename, inode) tuples in directories. – ninjalj Dec 18 '14 at 19:16
  • the problem is analogous to extracting a value from a map data structure, where the keys are the filenames, and the values the inodes. The maps might be nested (each directory holds a map to its entries), or flat (the root directory maps full pathnames to entries), but you still need to build that correspondence somehow. – didierc Dec 18 '14 at 22:35
  • The question you are therefore asking is: is there a mapping data structure which maps keys to values in constant time, and if so, is that structure usable in the field of file systems? – didierc Dec 18 '14 at 22:36
  • Or, what is the fastest structure which may be implemented there? – didierc Dec 18 '14 at 22:37

3 Answers3

3

Some popular filesystems use more efficient data structures than older filesystems. ext4 has directory hashing turned on by default (as pointed out by @ninjalj), as does XFS. This means lookups in a single directory are expected to take O(1) on average (so constant time if your path has a fixed maximum number of subdirectories). This follows the performance of the hash function itself.

Even if you have zillions of files per directory, accessing a single file is very fast -- but only if you have the full path. If you do NOT have the full path, and instead have to look through a directory for a pattern, you are faced with O(n) on the number of entries in the directory. That's further exacerbated by a small read size (32k) for the default system-level directory reading calls.

(While ext4 directories can have huge numbers of files, they are limited to 64000 subdirectory entries.)

Community
  • 1
  • 1
Allen Luce
  • 7,859
  • 3
  • 40
  • 53
1

If the /path/to/file is available, (Note: still the performance and time complexity will largely depend on the on-disk structures and the underlying file system implementation. Ex btrfs, everything is b-tree, ext4 and XFS use H-trees)

Therefore for traversing the directory structure until the leaf node (directory which contains the file), the average case time complexity should be O(logN), while worst case would be O(N), N = no of directories in the tree. The worst case is when you have Nth directory created under N-1, and N-1th directory created in N-2, and so on ... until the root directory, forming a single branch in the tree. Ideally you don't have to traverse through all the directories in the tree from the root if you have the full path.

Then if your underlying FS supports directory indexes and hashing, each lookup would require another O(1) in finding the file within the directory. Therefore, O(logN) + O(1), i.e ignoring the lower order terms it should be only O(logN), where N is the level.

askb
  • 6,501
  • 30
  • 43
  • Given a constant directory depth of 10, the complexity is O(1), not O(log n). The core of this question appears to be the italicized portion at the end concerning the performance of large directories. That case is entirely a function of the underlying directory index, not the structure of the subdirectory tree. – Allen Luce Dec 21 '14 at 02:09
  • 1
    @DougLuce Hi, When computing the time complexity its always good to take average and worst case, therefore I am considering N = no of directories in the tree, which is fairly large number. For any tree structures, lookups are always O(logN) and cant be O(1), say we have a path `/A/B/C/D/E/file` for a file, to get to the dirent under `E`, you have to do at least four lookups() starting from the root. The issue in the question is largely vague and dependent on the data structures used to implement FS itself. – askb Dec 21 '14 at 05:34
  • The number of levels is not O(log N) in either average or worst case. It's O(10) maximum, as was made explicit in the question. Also, there are any number of tree structures for which lookups are not O(log N) (linked lists, exponential trees, etc.). – Allen Luce Dec 21 '14 at 09:42
  • Hi Doug, You may be referring to a different data structure, while in the above case the FS implemention, its b-tree and not linked list or exponential list. [Complexity of BTree operations,](https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/avl/pages/btreecomplexity.html) [ wiki](http://en.wikipedia.org/wiki/B%2B_tree#Characteristics). – askb Dec 21 '14 at 13:50
0

This has changed over time.

Ext4:

  • As of linux 2.6.23
  • Up to a certain sub-directory count, it uses has tables and you get O(1).
  • Beyond a certain size, it uses B-Trees and you get O(log N) performance. Due to B-Trees, the depth is very low, so in real life it's only 2 levels deep at 12 million entries.
  • Some places mention that ext4 has subdirectory limits. Although later this limit was removed. I tested (Dec 2021) subdirectory count. I was able to create over 200k entries.
  • Note on overall performance: I found that creating 1,000,000 directories with a small file per directory is quite slow (10 minutes on an aws i3.24xlarge instance with 15 local nvme's) compared to writing a large file with all the data inside it. I'd only use it for large data sets where you want O(1)-ish performance on particular partition.

Source https://en.wikipedia.org/wiki/Ext4

I ran a benchmark with 100 and 500,000 files in them. The average readtime was (more or less) the same confirming that it's O(1) access to sub-directories.

import os
home = os.path.expanduser("~/")

for i in range(0,100):     # Hundred.
    df = pd.DataFrame({"customerId": [i,i], "Phone" : [440000+i, 330000+i*2]})
    dir_path = home + "/tmp2/" + str(i)
    os.mkdir(dir_path)
    df.to_parquet(dir_path + "/data.parquet")

for i in range(0,500000): # 1/2 Million
    df = pd.DataFrame({"customerId": [i,i], "Phone" : [440000+i, 330000+i*2]})
    dir_path = home + "/tmp2/" + str(i)
    os.mkdir(dir_path)
    df.to_parquet(dir_path + "/data.parquet")

# 602007
# 620180
import random
# %timeit - n 10000 dfl = pd.read_parquet(home + "tmp2/" + str(random.randint(0, 99)))
# 1.82 ms ± 5.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# %timeit -n 10000 dfl = pd.read_parquet(home + "/tmp/" + str(random.randint(0, 500000-1)))
# 1.89 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Leo Ufimtsev
  • 6,240
  • 5
  • 40
  • 48