GNU parallel: assign one thread for each node (directories and sub* directories) of an entire tree from a start directory

Question

I would like to benefit from all the potential of parallel command on macOS (it seems there exists 2 versions, GNU and Ole Tange's version but I am not sure).

With the following command:

parallel -j8  find {} ::: *

I will have a big performance if I am located in a directory containing 8 subdirectories. But if all these subdirectories have a small content except for only one, I will have only one thread which will work on the unique "big" directory.

Is there a way to follow the parallelization for this "big directory"? I mean, can the unique thread remaining be helped by other threads (the previous that worked on small subdirectories)?

The ideal case would be that parallel command "switch automatically" when all small sub has been found by find command in the command line above. Maybe I ask too much?
Another potential optimization if it exists: considering a common tree directory structure: Is there a way, similar to for example the command make -j8, to assign each current thread to a sub-(sub-(sub- ....)))) directory and once the current directory has been explored (don't forget, I would like mostly to use this optimization with find Linux command), another thread explore another directory sub-(sub-(sub- ....)))) directory?

Of course, the number of total threads running is not greater than the number specified with parallel command (parallel -j8 in my example above): we can say that if a number of tree elements (1 node=1 directory) are greater than a number of threads, we cannot be over this number.

I know that parallelize in a recursive context is tricky but maybe I can gain a significant factor when I want to find a file into a big tree structure?

That's why I take the example of command make -j8: I don't know how it is coded but that makes me think that we could do the same with the couple parallel/find command line at the beginning of my post.

Finally, I would like to get your advice about these 2 questions and more generally what is possible and what is not possible currently for these suggestions of optimization in order to find more quickly a file with classical find command.

UPDATE 1: As @OleTange said, I don't know the directory structure a priori of what I want gupdatedb to index. So, it is difficult to know the maxdepth in advance. Your solution is interesting but the first execution of find is not multithreaded, you don't use parallel command. I am a little surprised that a multithread version of gupdatedb does not exist : on paper, it is faisible but once we want to code it in the script GNU gupdatedb of MacOS 10.15, it is more difficult.

If someone could have other suggestions, I would take them !

Ole Tange · Accepted Answer · 2020-08-14T21:19:44.743

2

If you are going to parallelize find you need to be sure that your disk can deliver data.

For magnetic drives you will rarely see a speedup. For RAID, network drives and SSD sometimes, and for NVMe often.

The simplest way to parallelize find is to use */*:

parallel find ::: */*

Or */*/*:

parallel find ::: */*/*

This will search in sub-sub dirs and in sub-sub-sub dirs.

They will not search the top dirs, but that can be done by running a single additional find with the appropriate -maxdepth.

The above solution assumes you know something about the directory structure, so it is not a general solution.

I have never heard of a general solution. It would involve a breadth first search that would start some workers in parallel. I can see how it could be programmed, but I have never seen it.

If I were to implement it, it would be something like this (lightly tested):

#!/bin/bash

tmp=$(tempfile)
myfind() {
  find "$1" -mindepth 1 -maxdepth 1
}
export -f myfind
myfind . | tee $tmp
while [ -s $tmp ] ; do
    tmp2=$(tempfile)
    cat $tmp | parallel --lb myfind | tee $tmp2
    mv $tmp2 $tmp
done
rm $tmp

(PS: I have reason to believe the parallel written by Ole Tange and GNU Parallel are one and the same).

edited Aug 14 '20 at 21:19

answered Aug 13 '20 at 15:15

Ole Tange

31,768
5
86
104

Perhaps once you have results from a single `find`, use them to partition the search space optimally on subsequent runs. In each run, build a secondary index which collects a database of number of files in each subtree, and then split that search space across eight instances when you schedule the next run. – tripleee Aug 14 '20 at 03:58
@OleaTange. Thanks for your quick answer. As you said, I don't know the directory structure a priori of what I want `gupdatedb` to index. So, it is difficult to know the `maxddepth` in advance. Your solution is interesting but the first execution of `find` is not multithreaded, you don't use `parallel` command. I am a little surprised that a multithread version of `gupdatedb` does not exist : on paper, it is faisible but once we want to code it in the script GNU `gupdatedb` of MacOS 10.15, it is more difficult for me as you can see. – Aug 15 '20 at 21:06
@youpilat13 The fist call to `find` only does find in the current dir. Unless all your files are saved in current dir, it will not be the majority of the runtime. – Ole Tange Aug 15 '20 at 21:42
@OleTange . ok, so there, how could adapt your script to be able to include it in `gupdatedb` MacOS command. From the beginning, I just want to index all files from main root `/` excepted specific directories which ara without interest and that I can exclude with the --prune-paths option of `gupdatedb`. Finally, my command to index all wanted content is : `sudo time gupdatedb --prunepaths='/private/tmp /private/var/folders /private/var/tmp */Backups.backupdb /System /Volumes' --localpaths='/' --output=$HOME/locatedb_gupdatedb_PARALLEL` . – Aug 15 '20 at 21:59
@OleTange With sequential version, all indexing takes roughly 30 minutes. I hope to gain a factor 2 or 3 with parallel version but maybe I am too optimistic. Do you do understand better the problematic ? Regards. PS : by the way, thanks to have coded the GNU command `parallel`, I have just found out that you are the author. – Aug 15 '20 at 21:59
@youpilat13 If the disk is the limiting factor you will not see any speedup - no matter how much you parallelize. On the contrary you may see some slowdown. – Ole Tange Aug 15 '20 at 22:22
@OleTange . So, we can't do with `parallel` command the same thing than with `make -j8` (just an example with 8 threads) ? – Aug 17 '20 at 20:14
@OleTange Do you agree that `make -j8` is a recurive parallelization and not a classical parallelization like doing `paralllel -j8 command` ? – Aug 17 '20 at 20:16
@youpilat13 GNU Parallel used to be implemented using `make -j` (see https://www.gnu.org/software/parallel/history.html) It is the same kind of parallelization, and you are unlikely so see a major speedup of using `make -j8` if you cannot get it with GNU Parallel. – Ole Tange Aug 17 '20 at 20:56
@OleTange . I have awarded to you the bounty since you have suggested me a lot of things. Cocnerning `make -j 8`, when I have big sources directories, it is very fast. I thought that I could reproduce this parallelization with the `gupdatdb` tool but you seem to say that it is not possible : maybe later if someone has the same problematic, he will be able to help me in order to mix correctly `gfind/parallel` in the command present in `gupdatedb` script. I am still opened to other suggestions. Regards – Aug 20 '20 at 18:42

GNU parallel: assign one thread for each node (directories and sub* directories) of an entire tree from a start directory

1 Answers1

Linked