4

I often use the find command on Linux and macOS. I just discovered the command parallel, and I would like to combine it with find command if possible because find command takes a long time when we search a specific file into large directories.

I have searched for this information but the results are not accurate enough. There appear to be a lot of possible syntaxes, but I can't tell which one is relevant.

How do I combine the parallel command with the find command (or any other command) in order to benefit from all 16 cores that I have on my MacBook?

Update

From @OleTange, I think I have found the kind of commands that interests me.

So, to know more about these commands, I would like to know the usefulness of characters {}and :::in the following command :

parallel -j8 find {} ::: *

1) Are these characters mandatory ?

2) How can I insert classical options of find command like -type f or -name '*.txt ?

3) For the moment I have defined in my .zshrc the function :

ff () {
    find $1 -type f -iname $2 2> /dev/null
}

How could do the equivalent with a fixed number of jobs (I could also set it as a shell argument)?

halfer
  • 19,824
  • 17
  • 99
  • 186
  • Your question is too vague to answer. Please add details of the sort of commands you need to use... are you looking for files with specific names or with specific content? It's possible your search is completely I/O-bound and throwing more cores at it will not help. – Mark Setchell Jul 26 '20 at 15:25
  • GNU `parallel` is hardly new any longer. Note though that the syntax changed and so many enthusiastic examples from many years ago no longer work. – tripleee Jul 29 '20 at 05:11
  • @tripleee What syntax changed? – Ole Tange Jul 29 '20 at 07:23
  • @OleTange The original `parallel` by Tollef Fog Heen used dashes where later versions would use triple colons, or some such. There was a bug in Ubuntu where the documentation used the new syntax but the default configuration had enabled the old syntax back on instead. But this was many years ago. Looks like https://stackoverflow.com/questions/16448887/gnu-parallel-not-working-at-all (though I guess you probably know the back story better than I do, or can guess). – tripleee Jul 29 '20 at 07:25
  • As you are offering a 50 point bounty, I guess you are fairly keen to get a good answer. You can help yourself by replying to my comment above as the type of search you are doing could make a massive difference. – Mark Setchell Jul 29 '20 at 07:49
  • By the way, `make -j 8` runs 8 parallel *processes,* not *threads.* These are different granuarities, with different characteristics. Threads within the same process can manipulate shared memory and have less overhead (somethimes threads are called "light processes") whereas processes have only limited facilities for communicating with other processes. – tripleee Jul 29 '20 at 09:20
  • Your latest edit is problematic because it throws the answers you already received out of sync. You are also piling on new stuff which makes this approach closable as too broad, and of course, if you are asking about things which are documented in the manual, you should really demonstrate that you have tried reading it, and are still having trouble understanding some specific part of it. Please consider rolling back at least partially, to keep this answerable and focused. – tripleee Jul 30 '20 at 04:36
  • @tripleee It seems you are aware that GNU Parallel and Tollef's Parallel are two completely different products with no historical overlap (GNU Parallel has roots back to around 2001. I am not sure how old Tollef's parallel is - thus it is unclear which one qualifies as the "original"). So I am puzzled when you write: "Note though that the syntax changed". To me that sounds misleading. Would it not be more correct to say: "Make sure you use `--arg-sep --` if you are using examples written for Tollef's parallel". – Ole Tange Jul 30 '20 at 11:34
  • No, I wasn't aware. I guess I stand corrected; thanks for clarifying. – tripleee Jul 30 '20 at 11:35
  • **GNU Parallel** is unlikely to help you if you are just looking for files with a specific name because your performance is limited by I/O speed, not CPU performance. If you actually want a solution, you would be better off indexing your files with a tool like `locate`. – Mark Setchell Jul 30 '20 at 21:58
  • @MarkSetchell You seem to be repeating suggestions which is already in my answer. – tripleee Aug 03 '20 at 10:50
  • @triplee Sorry, I didn't mean to *"tread on your toes"* at all. In fact, your answer already has my upvote:-) I wrote it because OP has sadly not replied to my very first comment, or any others above, and all discussion in the comments seems centred on **GNU Parallel** which IMHO seems inappropriate for the given problem. So in a way, I was *"signing off"* from any expectation of my providing an answer and saying *"I think you are on the wrong track and as you aren't answering I wash my hands of the question"*. I guess you are suggesting the same approach in a more constructive way than me:-) – Mark Setchell Aug 03 '20 at 11:15

4 Answers4

2

Parallel processing makes sense when your work is CPU bound (the CPU does the work, and the peripherals are mostly idle) but here, you are trying to improve the performance of a task which is I/O bound (the CPU is mostly idle, waiting for a busy peripheral). In this situation, adding parallelism will only add congestion, as multiple tasks will be fighting over the already-starved I/O bandwidth between them.

On macOS, the system already indexes all your data anyway (including the contents of word-processing documents, PDFs, email messages, etc); there's a friendly magnifying glass on the menu bar at the upper right where you can access a much faster and more versatile search, called Spotlight. (Though I agree that some of the more sophisticated controls of find are missing; and the "user friendly" design gets in the way for me when it guesses what I want, and guesses wrong.)

Some Linux distros offer a similar facility; I would expect that to be the norm for anything with a GUI these days, though the details will differ between systems.

A more traditional solution on any Unix-like system is the locate command, which performs a similar but more limited task; it will create a (very snappy) index on file names, so you can say

locate fnord

to very quickly obtain every file whose name matches fnord. The index is simply a copy of the results of a find run from last night (or however you schedule the backend to run). The command is already installed on macOS, though you have to enable the back end if you want to use it. (Just run locate locate to get further instructions.)

You could build something similar yourself if you find yourself often looking for files with a particular set of permissions and a particular owner, for example (these are not features which locate records); just run a nightly (or hourly etc) find which collects these features into a database -- or even just a text file -- which you can then search nearly instantly.

For running jobs in parallel, you don't really need GNU parallel, though it does offer a number of conveniences and enhancements for many use cases; you already have xargs -P. (The xargs on macOS which originates from BSD is more limited than GNU xargs which is what you'll find on many Linuxes; but it does have the -P option.)

For example, here's how to run eight parallel find instances with xargs -P:

printf '%s\n' */ | xargs -I {} -P 8 find {} -name '*.ogg'

(This assumes the wildcard doesn't match directories which contain single quotes or newlines or other shenanigans; GNU xargs has the -0 option to fix a large number of corner cases like that; then you'd use '%s\0' as the format string for printf.)


As the parallel documentation readily explains, its general syntax is

parallel -options command ...

where {} will be replaced with the current input line (if it is missing, it will be implicitly added at the end of command ...) and the (obviously optional) ::: special token allows you to specify an input source on the command line instead of as standard input.

Anything outside of those special tokens is passed on verbatim, so you can add find options at your heart's content just by specifying them literally.

parallel -j8 find {} -type f -name '*.ogg' ::: */

I don't speak zsh but refactored for regular POSIX sh your function could be something like

ff () {
    parallel -j8 find {} -type f -iname "$2" ::: "$1"
}

though I would perhaps switch the arguments so you can specify a name pattern and a list of files to search, à la grep.

ff () {
    # "local" is not POSIX but works in many sh versions
    local pat=$1
    shift
    parallel -j8 find {} -type f -iname "$pat" ::: "$@"
}

But again, spinning your disk to find things which are already indexed is probably something you should stop doing, rather than facilitate.

tripleee
  • 175,061
  • 34
  • 275
  • 318
2

You appear to want to be able to locate files quickly in large directories under macOS. I think the correct tool for that job is mdfind.

I made a hierarchy with 10,000,000 files under my home directory, all with unique names that resemble UUIDs, e.g. 80104d18-74c9-4803-af51-9162856bf90d. I then tried to find one with:

mdfind -onlyin ~ -name 80104d18-74c9-4803-af51-9162856bf90d

The result was instantaneous and too fast to measure the time, so I did 100 lookups and it took under 20s, so on average a lookup takes 0.2s.


If you actually wanted to locate 100 files, you can group them into a single search like this:

mdfind -onlyin ~ 'kMDItemDisplayName==ffff4bbd-897d-4768-99c9-d8434d873bd8 || kMDItemDisplayName==800e8b37-1f22-4c7b-ba5c-f1d1040ac736 || kMDItemDisplayName==800e8b37-1f22-4c7b-ba5c-f1d1040ac736'

and it executes even faster.


If you only know a partial filename, you can use:

mdfind -onlyin ~ "kMDItemDisplayName = '*cdd90b5ef351*'"
/Users/mark/StackOverflow/MassiveDirectory/800f0058-4021-4f2d-8f5c-cdd90b5ef351

You can also use creation dates, file types, author, video duration, or tags in your search. For example, you can find all PNG images whose name contains "25DD954D73AF" like this:

mdfind -onlyin ~ "kMDItemKind = 'PNG image' && kMDItemDisplayName = '*25DD954D73AF*'"
/Users/mark/StackOverflow/MassiveDirectory/9A91A1C4-C8BF-467E-954E-25DD954D73AF.png

If you want to know what fields you can search on, take a file of the type you want to be able to look for, and run mdls on it and you will see all the fields that macOS knows about:

mdls SomeMusic.m4a
mdls SomeVideo.avi
mdls SomeMS-WordDocument.doc

More examples here.

Also, unlike with locate, there is no need to update a database frequently.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • I have already tried `mdfind`and this is catastrophic If I could say. Indeed, the construction of the index is totally a black box : all we can see with `top`or `'htop` are `mdworker_shared` running all time, we don't know what they do precisely, the database itself is also ompletly obscured, the result of research is found one time on two tries, so I am disapointed by this tool. I have also tried multiple times to reindex from scratch the database to have a new, full and complete index of files (actually the database doesn't only contain the filenames). –  Aug 04 '20 at 21:57
  • But some problems remain since some files are found and some others not, so I think that I am gonna to give up `mdfind` excepted if I have only this tool available. –  Aug 04 '20 at 21:57
1

Just use background running at each first level paths separately

In example below will create 12 subdirectories analysis

 $ for i in [A-Z]*/ ; do find "$i" -name "*.ogg" & >> logfile ; done 
[1] 16945
[2] 16946
[3] 16947
# many lines
[1]   Done                    find "$i" -name "*.ogg"
[2]   Done                    find "$i" -name "*.ogg"
#many lines
[11]   Done                    find "$i" -name "*.ogg"
[12]   Done                    find "$i" -name "*.ogg"
 $

Doing so creates many find process the system will dispatch on different cores as any other.

Note 1: it looks a little pig way to do so but it just works..

Note 2: the find command itself is not taking hard on cpus/cores this is 99% of use-case just useless because the find process will spend is time to wait for I/O from disks. Then using parallel or similar commands won't work*

tripleee
  • 175,061
  • 34
  • 275
  • 318
francois P
  • 306
  • 6
  • 20
  • 1
    I fixed the shell quoting, but you probably also want to change the redirection, though I'm not entirely sure what you hope for it to accomplish. Probably you want `done >logfile` here? – tripleee Jul 29 '20 at 05:05
  • 1
    If you want the `Done` output to show which `"$i"` it's done with, you have to add a fugly `eval`. That will always get you some shrieks from the security-conscious, but if you first check that you have no surprising matches from the wildcard, it should be acceptable. – tripleee Jul 29 '20 at 05:06
  • 1
    The wildcard might not match anything at all, and is certainly not guaranteed to produce exactly twelve matches. I guess you happened to have 12 directories whose name started with a capital letter? – tripleee Jul 29 '20 at 07:38
1

As others have written find is I/O heavy and most likely not limited by your CPUs.

But depending on your disks it can be better to run the jobs in parallel.

NVMe disks are known for performing best if there are 4-8 accesses running in parallel. Some network file systems also work faster with multiple processes.

So some level of parallelization can make sense, but you really have to measure to be sure.

To parallelize find with 8 jobs running in parallel:

parallel -j8  find {} ::: *

This works best if you are in a dir that has many subdirs: Each subdir will then be searched in parallel. Otherwise this may work better:

parallel -j8  find {} ::: */*

Basically the same idea, but now using subdirs of dirs.

If you want the results printed as soon as they are found (and not after the find is finished) use --line-buffer (or --lb):

parallel --lb -j8  find {} ::: */*

To learn about GNU Parallel spend 20 minutes reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 and print the cheat sheet: https://www.gnu.org/software/parallel/parallel_cheat.pdf

Your command line will thank you for it.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thanks for your fast reply. That's the kind of command I would like to use. I am going to put an **UPDATE 1** in my original post to better explore these commands, especially the functionality of `:::` characters in the different commands you gave, and how to insert classcial options of `find`. Take me informed please, Regards –  Jul 29 '20 at 23:18
  • 1
    @youpilat13 ::: and {} are explained in chapter 2: https://doi.org/10.5281/zenodo.1146014 – Ole Tange Jul 30 '20 at 11:41