1

I want to find all git repositories lying in some directory, but not its subdirectories, say ~/repo. Two simple approaches are

find ~/repo -depth 2 -type d -name '.git' | while read repo …

or

for repo in ~/repo/*/.git …

The version using find is magnitudes slower than the one with the globbing pattern. I am very surprised by this, because there is no real reason why one method would need more system calls than the other to gather its informations. I tried a smarter version of the find invocation

find ~/repo -depth 3 -prune -o -depth 2 -type d -name '.git' -print | while read repo …

without any noticeable improvement. Unfortunately I was not able to trace system calls to figure out how find is working here.

What explains the huge speed difference between these two methods? (The shell is /bin/sh which I believe to be some obsolete version of bash.)

Community
  • 1
  • 1
Michaël Le Barbier
  • 6,103
  • 5
  • 28
  • 57
  • `/bin/sh` is not *"some obsolete version of `bash`"*. [`sh`](https://en.wikipedia.org/wiki/Bourne_shell) is an the original Unix shell interpreter, [`bash`](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29) is a newer replacement for it (and compatible with it up to some point). Nowadays, `/bin/sh` is usually a symbolic link or a hard link to either [`bash`](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) or [`dash`](https://en.wikipedia.org/wiki/Almquist_shell). Either way, the shell does not affect the performance of `find` in any way. – axiac Jun 25 '15 at 09:24
  • @axiac On Mac OS X Yosemite the executable `/bin/sh` is *actually* an obsolete version of **bash** `/bin/sh --version` answers `GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14) Copyright (C) 2007 Free Software Foundation, Inc.` while current version is 4.3. You misinterpreted my statement! :) – Michaël Le Barbier Jun 25 '15 at 09:27
  • Indeed, on OSX `/bin/sh` is not a link to `bash` but a different file (not even a copy, they have different sizes). But both of them report the same version to me (the one you listed). I never needed the features introduced by version 4 so I guess I interpreted wrong the word "obsolete" :-) – axiac Jun 25 '15 at 09:42
  • @axiac If you want to uncover bugs in obsolete bash, doing mildly-complex job management is an excellent start. :D – Michaël Le Barbier Sep 14 '16 at 12:11

2 Answers2

2

You can use:

find ~/repo -maxdepth 2 -mindepth 2 -type d -name '.git'

This would reproduce the logic of the globbing more exactly. Also note that the option isn't portable and will not work on GNU systems.

Btw, instead of piping into a while loop, I would use the -exec option of find.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • There is a difference between `-depth n` and `-depth`. Actually there is no real reason why `-depth n` and `-maxdepth n -mindepth n` should behave differently. Nevertheless that later approach gives better results. – Michaël Le Barbier Jun 25 '15 at 09:13
  • @MichaelGrünewald You are right. I have reworded the answer. Did you managed to get an system call trace? (I don't have an OSX by the hand).. – hek2mgl Jun 25 '15 at 09:19
  • @hex2mgl Yes I did with `sudo dtruss find ~/repo -depth 2 -type d -name '.git'` and it shows that `find` does much more calls than it really needs to, it actually `lstat` each file in the subtree! – Michaël Le Barbier Jun 25 '15 at 09:22
  • I can't reproduce this behaviour using the GNU version of find. I've created 10.000 git repos. Both versions, glob and find took nearly the same amount of time (sys, user and real) – hek2mgl Jun 25 '15 at 09:26
  • Can you try to install the GNU version of find on your OSX and compare them again? – hek2mgl Jun 25 '15 at 09:28
  • If `find` unnecessarily checks the files in the 2nd level subdirectories, add `-prune` after `-depth 2` – axiac Jun 25 '15 at 09:50
  • Could you point out the part where it's *explicitly forbidden* by POSIX? – geirha Jul 14 '15 at 18:56
  • @geirha Good Point! Looks that I was wrong, it is not explicitly forbidden - `-depth NUMBER` has simply not being mentioned there. – hek2mgl Jul 14 '15 at 19:29
1

Update: the test -depth with arguments (-depth 2) is not specified in the documentation of GNU find. It is probably an OSX extension. Don't use it!

Use -mindepth 2 -maxdepth 2 instead, as suggested by @hek2mgl in their answer.


OSX specific

It seems the OSX version of find unnecessarily descends into directories deeper than 2 levels when -depth 2 is used (but this is the correct behaviour, see below).

You can tell it to not do that by adding -prune immediately after -depth 2 (it seems it doesn't have any effect if you put it somewhere else):

find ~/repo -depth 2 -prune -type d -name .git

Some benchmarks:

$ time (find . -depth 4 -prune -type d -name .git | wc -l)
      20

real 0m0.064s
user 0m0.009s
sys  0m0.046s

Moved -prune at the end and it suddenly needs a lot of time to run:

$ time (find . -depth 4 -type d -name .git -prune | wc -l)
      20

real 0m12.726s
user 0m0.325s
sys  0m9.298s

Remarks

On a second thought (and after a closer reading of man find) -depth 2 does not require find to stop descending in directories deeper than two levels. It can be part of a more complex condition that requires -depth 2 or something else (f.e. find . -depth 2 -or -name .git).

To force it to stop descending more than two levels you must use either -maxdepth 2 or -depth 2 -prune.

  • -maxdepth tells it to not go deeper than two levels;
  • -depth 2 -prune tells it to stop descending into subdirectories if the directory under examination is two levels deep.

They have equivalent behaviour, choosing one or another is a matter of preference. I would choose -maxdepth 2 because it is more clear.

Conclusion

Because -depth 2 is not portable, the final command should be like:

find ~/repo -mindepth 2 -maxdepth 2 -type d -name '.git' -print

Thanks @hek2mgl for mentioning about the compatibility issue.

Community
  • 1
  • 1
axiac
  • 68,258
  • 9
  • 99
  • 134
  • Good explanation. However I would simply use `-maxdepth 2 -mindepth 2` as I have suggested since the `-depth n` option is not portable anyway. – hek2mgl Jun 25 '15 at 10:36
  • @hek2mgl You are right. I updated the answer mentioning `-depth 2` is not portable. Your solution is better. – axiac Jun 25 '15 at 10:55
  • 1
    Just for the sake of completion, OS-X user land is based on FreeBSD, so the **find** program there is the BSD variant of the **find** command. Both implementations are not totally compatible. – Michaël Le Barbier Jun 25 '15 at 11:09
  • 1
    @MichaelGrünewald Yeah, that's what I'm saying. – hek2mgl Jun 25 '15 at 11:21