2

I am trying to use git grep to search all revisions of a very large repository. The command I am using is:

$ git rev-list --all | xargs git grep -I --threads 10 --line-number \
  --only-matching "SomeString"

I am using the latest official version of git on mac:

$ git --version
git version 2.19.1

It's taking a very long time, looking at activity monitor git is only using one thread. However the docs say it should use 8 by default. It only uses one thread with or without the --threads <num> option. I don't have any other config set that would override this setting either:

$ git config --list
credential.helper=osxkeychain
user.name=****
user.email=****

Any ideas what I'm missing? Can anybody else use git-grep and confirm that they see multiple threads?

Thanks for any help

msanford
  • 11,803
  • 11
  • 66
  • 93
Chris
  • 6,076
  • 11
  • 48
  • 62

2 Answers2

2

I wonder if it's because you're using | xargs, which waits for input on stdin. Since the output from git rev-list is a single stream, xargs, by default will use only one process:

-P max-procs, --max-procs=max-procs
              Run up to max-procs processes at a time; **the default is 1**.  If
              max-procs is 0, xargs will run as many processes as possible
              at a time.

So try increasing it using the above flag:

git rev-list --all | xargs -P 10 git grep -I --threads 1 --line-number \
    --only-matching "SomeString"

This will spawn multiple git greps, rather that enable git grep to use multiple threads, so a sort-of-functional answer.

msanford
  • 11,803
  • 11
  • 66
  • 93
  • 2
    I suspect it should be `xargs -P 10 git grep -I --threads 1`. – phd Nov 13 '18 at 14:58
  • 1
    @phd Yes that would make more sense. I'll update; thanks! This only gives a functionally approximate solution, though, since git grep won't actually be multi-threaded, and that may make a difference for the architecture OP is running on. – msanford Nov 13 '18 at 15:06
  • This makes sense thanks. It does indeed spawn multiple git processes. I think that is OK though. There is presumably a chance the output could get interlaced? Need to use `xargs` as dealing with > 50k commits, if there is another way to invoke git with that many arguments without the shell bombing out I'd be interested as that might be better to leverage git's multithreading. – Chris Nov 13 '18 at 16:20
  • 1
    @Chris _There is presumably a chance the output could get interlaced?_ - I was thinking this, too. I'm unsure, to be honest. (In any case, give it a shot, but I wouldn't accept this just yet - leave some time for someone else to contribute something more clever). – msanford Nov 13 '18 at 16:38
  • 1
    @msanford Yes I'm running a test now. Previously was taking around 8 hours. Laptop sounds like it's about to take off so definitely maxing out resources now. Suspect your approach will be a lot faster, will reply back with results when this completes. Also will hold off accepting for a while in case anyone knows how to use git's internal threading when passing this many arguments. – Chris Nov 13 '18 at 16:43
  • 1
    Going to accept this answer. It reduced the time from 8 hours to just over an hour. – Chris Nov 15 '18 at 11:29
  • @Chris Excellent! – msanford Nov 15 '18 at 13:37
0

The number of threads to allocate to xargs will depends on the number of threads used by git grep.

It used to be 8 by default for git grep.

But:

With Git 2.26 (Q1 2020), this is now the number of cores.

See commit f1928f0, commit 70a9fef, commit 1184a95, commit 6c30762, commit c441ea4, commit d799242, commit 1d1729c, commit 31877c9, commit b1fc9da, commit d5b0bac, commit faf123c, commit c3a5bb3 (16 Jan 2020) by Matheus Tavares (matheustavares).
(Merged by Junio C Hamano -- gitster -- in commit 56ceb64, 14 Feb 2020)

grep: use no. of cores as the default no. of threads

Signed-off-by: Matheus Tavares

When --threads is not specified, git grep will use 8 threads by default.

This fixed number may be too many for machines with fewer cores and too little for machines with more cores.
So, instead, use the number of logical cores available in the machine, which seems to result in the best overall performance.

The following measurements correspond to the mean elapsed times for 30 git grep executions in chromium's repository with a 95% confidence interval (each set of 30 were performed after 2 warmup runs).
Regex 1 is 'abcd[02]' and Regex 2 is '(static|extern) (int|double) \*'.

(chromium’s repo at commit 03ae96f (“Add filters testing at DSF=2”, 04-06-2019), after a 'git gc' execution.)

      |          Working tree         |           Object Store
------|-------------------------------|--------------------------------
 #ths |  Regex 1      |  Regex 2      |   Regex 1      |   Regex 2
------|---------------|---------------|----------------|---------------
  32  |  2.92s ± 0.01 |  3.72s ± 0.21 |   5.36s ± 0.01 |   6.07s ± 0.01
  16  |  2.84s ± 0.01 |  3.57s ± 0.21 |   5.05s ± 0.01 |   5.71s ± 0.01
   8  |  2.53s ± 0.00 |  3.24s ± 0.21 |   4.86s ± 0.01 |   5.48s ± 0.01
   4  |  2.43s ± 0.02 |  3.22s ± 0.20 |   5.22s ± 0.02 |   6.03s ± 0.02
   2  |  3.06s ± 0.20 |  4.52s ± 0.01 |   7.52s ± 0.01 |   9.06s ± 0.01
   1  |  6.16s ± 0.01 |  9.25s ± 0.02 |  14.10s ± 0.01 |  17.22s ± 0.01

The above tests were performed in a desktop running Debian 10.0 with Intel(R) Xeon(R) CPU E3-1230 V2 (4 cores w/ hyper-threading), 32GB of RAM and a 7200 rpm, SATA 3.1 HDD.

Bellow, the tests were repeated for a machine with SSD: a Manjaro laptop with Intel(R) i7-7700HQ (4 cores w/ hyper-threading) and 16GB of RAM:

      |          Working tree          |           Object Store
------|--------------------------------|--------------------------------
 #ths |  Regex 1      |  Regex 2       |   Regex 1      |   Regex 2
------|---------------|----------------|----------------|---------------
  32  |  3.29s ± 0.21 |   4.30s ± 0.01 |   6.30s ± 0.01 |   7.30s ± 0.02
  16  |  3.19s ± 0.20 |   4.14s ± 0.02 |   5.91s ± 0.01 |   6.83s ± 0.01
   8  |  2.90s ± 0.04 |   3.82s ± 0.20 |   5.70s ± 0.02 |   6.53s ± 0.01
   4  |  2.84s ± 0.02 |   3.77s ± 0.20 |   6.19s ± 0.02 |   7.18s ± 0.02
   2  |  3.73s ± 0.21 |   5.57s ± 0.02 |   9.28s ± 0.01 |  11.22s ± 0.01
   1  |  7.48s ± 0.02 |  11.36s ± 0.03 |  17.75s ± 0.01 |  21.87s ± 0.08
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250