29

Some time ago I learned that rsync deletes files much faster that many other tools.

A few days ago I came across this wonderful answer on Serverfault which explains why rsync is so good at deleting files.

Quotation from that answer:

I revisited this today, because most filesystems store their directory structures in a btree format, the order of which you delete files is also important. One needs to avoid rebalancing the btree when you perform the unlink. As such I added a sort before deletes occur.

Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings?


I expect the answer to show how deleting in order increase deletion speed, with details of what happens at btree level. People, who wrote rsync and another programs (see links in the question) used this knowledge to create better programs. I think it's important for other programmers to have this understanding to be able to write better soft.

Community
  • 1
  • 1
ovgolovin
  • 13,063
  • 6
  • 47
  • 78
  • Your rsync link no longer works. [This archive link](https://web.archive.org/web/20130929001850/http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html) works; I found it on [this related question](https://unix.stackexchange.com/questions/37329/efficiently-delete-large-directory-containing-thousands-of-files). – nh2 Dec 31 '17 at 20:00
  • 1
    I have now benchmarked both `rm -r` and `rsync`. They both show O(n²) performance for deleting dirs containing n files, on ext4 and xfs. A [coreutils patch from 2008](http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=24412edeaf556a) claimed that this was fixed and should have O(n) performance. I filed a new bug about that [here](http://lists.gnu.org/archive/html/bug-coreutils/2017-12/msg00054.html) – nh2 Jan 01 '18 at 02:48

5 Answers5

12

It is not important, nor b-tree issue. It is just a coincidence.

First of all, this is very much implementation dependent and very much ext3 specific. That's why I said it's not important (for general use). Otherwise, put the ext3 tag or edit the summary line.

Second of all, ext3 does not use b-tree for the directory entry index. It uses Htree. The Htree is similar to b-tree but different and does not require balancing. Search "htree" in fs/ext3/dir.c.

Because of the htree based index, a) ext3 has a faster lookup compare to ext2, but b) readdir() returns entries in hash value order. The hash value order is random relative to file creation time or physical layout of data. As we all know random access is much slower than sequential access on a rotating media.

A paper on ext3 published for OLS 2005 by Mingming Cao, et al. suggests (emphasis mine):

to sort the directory entries returned by readdir() by inode number.

Now, onto rsync. Rsync sorts files by file name. See flist.c::fsort(), flist.c::file_compare(), and flist.c::f_name_cmp().

I did not test the following hypothesis because I do not have the data sets from which @MIfe got 43 seconds. but I assume that sorted-by-name was much closer to the optimal order compare to the random order returned by readdir(). That was why you saw much faster result with rsync on ext3. What if you generate 1000000 files with random file names then delete them with rsync? Do you see the same result?

Yasushi Shoji
  • 4,028
  • 1
  • 26
  • 47
  • Thank you! I can't grasp one thing. After sorting we still delete the files one by one, so this is still random access (just guessing). How can filesystem get advantage of getting files for deletion in order, if those files are sent to it for deletion separately? – ovgolovin Aug 19 '13 at 05:51
  • I'd say it's sequential, because the Linux block layer merges close-by requests from the filesystem layer before hitting the rotating media. As you know, unlink(2) does not wipe whole file but mark them "deleted" in the file system. Those marks, including block pointers, are very close to each other. The point is that because we are talking about very very large directory, there are a huge number of those marks which takes large space. Accessing the space randomly does not give the block layer any chance to merge the requests and cause major overhead compared to sequential merged access. – Yasushi Shoji Aug 19 '13 at 08:22
  • If I understand things right, OS should first remove file from directory, and then mark inodes as empty. This should lead to moving HDD head between these folder and inode structures for a single file. Will Linux be able to batch deletions of several files in such a way as to make advantage of sequential inode ordering (it seems that that it should first get done with folders and then update inodes in batch for all those files). – ovgolovin Aug 20 '13 at 05:27
  • Not that I know of. Unix/Posix/Linux system call is, I suppose, designed to be simple. Deleting a dir does not always delete files in it; There is a thing called "hard link". So, even you have such a syscall, it has to check inodes one by one anyway. – Yasushi Shoji Aug 20 '13 at 10:38
  • 4
    You are right, this is coincidence, when testing I created files in lexical order (file1,file2,file3) so the inodes will naturally be in the same order. – Matthew Ife Feb 05 '14 at 10:13
9

Let's assume that the answer you posted is correct, and that the given file system does indeed store things in a balanced tree. Balancing a tree is a very expensive operation. Keeping a tree "partially" balanced is pretty simple, in that when you allow for a tree to be imbalanced slightly, you only worry about moving things around the point of insertion/deletion. However, when talking about completely balanced trees, when you remove a given node, you may find that suddenly, the children of this node could belong on the complete opposite side of the tree, or a child node on the opposite side has become the root node, and all of it's children need to be rotated up the tree. This requires you to do either a long series of rotations, or to place all the items into an array and re-create the tree.

            5
    3               7
2       4       6       8

now remove the 7, easy right?

            5
    3               8
2       4       6       

Now remove the 6, still easy, yes...?

            5
    3               8
2       4       

Now remove the 8, uh oh

            5
    3               
2       4

Getting this tree to be the proper balanced form like:

        4
    3       5
2

Is quite expensive, compared at least to the other removals we have done, and gets exponentially worse as the depth of our tree increases. We could make this go much(exponentially) faster by removing the 2 and the 4, before removing the 8. Particularly if our tree was more than 3 levels deep.

Without sorting removal is on average a O(K * log_I(N)^2). N representing the number of elements total, and K the number to be removed, I the number of children a given node is permitted, log_I(N) then being the depth, and for each level of depth we increase the number of operations quadratically.

Removal with some ordering help is on average O(K * log_I(N)), though sometimes ordering cannot help you and you are stuck removing something that will require a re-balance. Still, minimizing this is optimal.

EDIT:

Another possible tree ordering scheme:

            8
    6               7   
1       2       3       4

Accomplishing optimal removal under this circumstance would be easier, because we can take advantage of our knowledge of how things are sorted. Under either situation it is possible, and in fact both are identical, under this one the logic is just a little simpler to understand, because the ordering is more human friendly for the given scenario. In either case in-order is defined as "remove the farthest leaf first", in this case it just so happens to be that the farthest leaves are also the smallest numbers, a fact that we could take advantage of to make it even a little more optimal, but this fact is not necessarily true for the file system example presented(though it may be).

MobA11y
  • 18,425
  • 3
  • 49
  • 76
  • From what I see, removing in-order leads to disbalance (for example if we remove 2 3 4, then tree would need to be rebalanced). I don't see how removing in-order helps. – ovgolovin Jul 30 '13 at 19:58
  • The ordering I chose was to demonstrate how balanced binary trees work, not to demonstrate the best ordering for removal or ordering that file systems would use. The logic is the same indpendent of the number of children a given node is allowed to have, binary (2 children) trees are just easiest to demonstrate in text. – MobA11y Jul 30 '13 at 20:01
  • But in that post author seems to be using plain sorting to decrease number of rebalancings. – ovgolovin Jul 30 '13 at 20:02
  • I don't care what ordering the author was using, the point of this post is to demonstrate the pain rebalancing causes, and why it is good to avoid it, and how rsync can achieve faster performance by doing so! AKA: answering the question: Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings? "in-order" does not always mean the obvious thing. In this case in order means the farthest leaf down the tree first, not 1 < 2. – MobA11y Jul 30 '13 at 20:03
  • Conversely, we could change our ordering criterion on the tree itself to make both mean the same thing. :) – MobA11y Jul 30 '13 at 20:09
  • I think it's clear to everybody that rebalancing causes additional operations. The question was about more practical aspect of how removing **in-order** helps reduce the number of rebalancings. If by "in-order" author means a different order from sorted-order, I would like to know what order it is. The question is about this. – ovgolovin Jul 30 '13 at 20:16
  • About changing ordering criterion on the tree - Tree is determined by filesystem, and reasoning about changing filesystem btree order may be interesting, but it is kind of unpractical. – ovgolovin Jul 30 '13 at 20:18
  • The post answers both aspects. A: rebalancing is hard, obvious. B: removing 2 and 4, before removing 8 is optimal, as it keeps the tree balanced. You're confusion is confusing me. I'm going to add another example of ordering a tree can have. Basically, all parents at a given level are greater than all children at the level below. It may be easier to see how if your ordering criterion matches well, then accomplishing optimal removal is simple. – MobA11y Jul 30 '13 at 20:21
  • Sorry, btree is ordered like in your first picture. If you are saying that nodes should be removed in order from your last picture, then we should get this order in the program (there should be no access to the real btree structure on the disc, so in the program we can be only guessing how actual tree ordered to get benefit from it). As I understood, the just removed nodes in plain accending order and it reduced the number of rebalancings. – ovgolovin Jul 30 '13 at 20:28
  • Yes it is :). You are correct, guranteeing the ideal ordering is impossible, and a matter of chance without the information about where things are stored on the disk. But, we know how trees work, so when we see things like groups of files getting added at similar times, or to similar clusters, we can guess that these files would be added at a similar level of the tree, and attempt to remove these items at the same time. – MobA11y Jul 30 '13 at 20:34
2

I am not convinced that the number of B-tree rebalancing changes significantly if you delete the files in-order. However I do believe that the number of different seeks to external storage will be significantly smaller if you do this. At any time, the only nodes in the B-tree that need be visited will then be the far right boundary of the tree, whereas, with a random order, each leaf block in the B tree is visited with equal probability for each file.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • Yeah. You are probably right. Btree is used on hard drive files systems for reason, which is much faster sequential access. So if we traverse these leaves in order, the result should be much faster. On the other hand, when we do delete nodes from these leaves, we end up in rebalancing, so that access to another btree leaves, so that random access, which is slow. On the other hand rebalancing will occur anyway, but at least we traverse leaves in-order, thus reducing overhead. – ovgolovin Jul 31 '13 at 07:25
1

Rebalancing for B-Trees are cheaper than B-Tree+ implementations that's why most filesystems and database indexes implementations use them.

There are many approaches when deletion, depending on the approach it can be more efficient in terms of time and the need to rebalance the tree. You'll also have to consider the size of the node, since the number of keys the node can store will affect the need for rebalancing the tree. A large node size will just reorder keys inside the node, but a small one probably will make the tree rebalance many times.

A great resource for understanding this is the famous CLR (Thomas Cormen) book "Introduction to Algorithms".

rudygodoy
  • 438
  • 3
  • 8
  • Thanks! I started this question because I'm really interested in the topic. I got retweets from friends about `rsync` fast deleting, which continued in their guessing about why it is so fast (they guessed about its using several threads, etc., etc.). So the topic is really interesting for many programmers. Here on StackOverflow I received a lot of great answers which went deep into details, providing links to source code (example http://stackoverflow.com/a/17845560/862380), and explaining things much better than many books (http://stackoverflow.com/a/9513423/862380)... – ovgolovin Aug 08 '13 at 18:11
  • So I started this question as it may be interesting for many people. And an answer which shows an example of how `rsync` does delete things with details of what happens on file system level will give understanding to all those people and may even make them better specialists in some sense. As for me, to learn thing from real life examples is much more interesting and motivating than to pursue them with use of canonical theoretical books. – ovgolovin Aug 08 '13 at 18:15
0

On storage systems where hosting huge directories, the buffer cache will be under stress and buffers may get recycled. So, if you have deletes spaced apart by time, then the number of disk reads to get the btree back in core into the buffer cache, between deletes, may be high.

If you sort the files to be deleted, effectively you are delaying the deletes and bunching them. This may have the side effect of more deletes per block of btree paged in. If there are stats to say what the buffer cache hits are between the two experiments, it may tell if this hypo is wrong or not.

But, if there is no stress on the buffer cache during the deletes, then the btree blocks could stay in core and then my hypothesis is not a valid one.

lsk
  • 532
  • 4
  • 5