4

I've read a number of articles and answers here, but they weren't helpful.

I know that git uses mtime and ctime do detect that file was changed without reading it, that makes sense, but:

  • Running lstat on each file in my repo takes 79 seconds, but git does that in less than a second
  • How does it detect added or removed files without scanning the whole directory tree?

I tried looking into sources of diff-index but they seem to be quite complicated.

Please note that it's not a duplicate of How does git detect that a file has been modified?. I get that git uses mtime and ctime. I wonder how git can get them so fast. Or may be git doesn't compute them each time you run git diff? That's the point of this question.

LNK
  • 255
  • 1
  • 6
  • Does [this](https://stackoverflow.com/questions/2869213/how-does-git-save-space-and-is-fast-at-the-same-time) help ? – Asif Kamran Malick Mar 31 '21 at 15:31
  • 1
    @AsifKamranMalick, thanks for the link! But no, it doesn't have info I seek. I'm interested in precisely how git manages to detect file creations / deletions in deeply nested directories without traversing the whole tree (or traversing it so fast?) – LNK Mar 31 '21 at 15:49
  • 1
    May [this](https://stackoverflow.com/questions/1778862/how-does-git-detect-that-a-file-has-been-modified) help you? – Pat. ANDRIA Mar 31 '21 at 17:12
  • @Pat.ANDRIA, of course I read that question before asking :) it just states that git uses mtime and ctime, but it's not clear whether it runs lstat on each file? If not, then how exactly does it work. – LNK Mar 31 '21 at 21:39
  • 1
    Git does depend somewhat on fast `lstat` system calls, so much so that there is `git update-index --assume-unchanged` for systems with slow `lstat` system calls. Modern Git, though, has various other tricks. The biggest one for *some* systems is file system monitoring. You should add Git version and OS details. – torek Mar 31 '21 at 21:44
  • `how exactly does it work` Start browsing from here https://github.com/git/git/blob/142430338477d9d1bb25be66267225fb58498d92/wt-status.c#L754 [There also](https://github.com/git/git/blob/9198c13e34f6d51c983b31a9397d4d62bc2147ac/diff-lib.c#L292) a whole [file cache built](https://github.com/git/git/blob/a65ce7f831aa5fcc596c6d23fcde543d98b39bd7/read-cache.c#L1413) that caches all files, as I understand.. – KamilCuk Mar 31 '21 at 21:46
  • @torek, I am using git 2.24 and macOS. So, in order to detect changes git still has to traverse the whole directory tree and call `lstat` on each file? I have thousands of directories and it is still blazing fast – LNK Apr 01 '21 at 11:59
  • I'm pretty sure Git does not have any built in fsmonitor for OS X, but you can check with `git config get core.fsmonitor`. (My older Git on macOS does not have one.) Git has a few tricks up its sleeve with stat-ing an entire directory and not looking inside it if there cannot be new files in the directory that should be alerted-about. But since you have macOS, you can use [dtrace or dtruss to watch Git in action](https://stackoverflow.com/q/39189347/1256452). – torek Apr 01 '21 at 22:11
  • Just as a datapoint, a naive C program that recursively traverses a directory tree, calling `lstat` on every single file, takes approximately 0.1 seconds to execute for the Linux kernel source (running on my `Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz` hardware), which contains close to 80,000 files. This is without any sort of optimization or caching or anything. I think you may be underestimating the speed of modern computers. – larsks Apr 03 '21 at 23:42

1 Answers1

1

Long answer short:

strace -fostrace.log git diff-index --quiet @
vi strace.log

At least when there's a lot to do it fires off a big-batch-o'-threads issuing stat's in parallel so the filesystem's got a lot of pending requests and has the opportunity to prioritize for throughput.

Also:

git still has to traverse the whole directory tree

no, it doesn't. tttt, that's the reason the index is called "the cache". All the names (and last-it-looked data) it cares about it reads in in one big fat read right up front, .git/index is 5MB for a full linux checkout, that's going to be like two seeks, very few ms even the first time on a hdd, when that means hunting it up and siphoning its wiggly bits off a platter.

jthill
  • 55,082
  • 5
  • 77
  • 137
  • 1
    I don't understand your answer. How can Git know if a file has been added, removed, or changed without actually traversing the whole directory tree and comparing what it finds against the cache? The only way that could be true is if git somehow has hooks into the OS so that whenever the OS changes a file, `.git/index` is somehow updated. – CryptoFool Apr 04 '21 at 01:28
  • Removed shows up because it does a stat for everything listed in the index, new in the filesystem but not git added, i.e. untracked, it doesn't care about by definition, tracked is by definition in the index, any changes since checkout/add shows up because it does a stat for it. – jthill Apr 04 '21 at 01:51
  • What it doesn't have to do is ask the os to walk the directories. It can read in the index in one swell foop then do a stat per indexed file and focus on finding whatever approach gets that done fastest. That approach is the one strace shows it taking. – jthill Apr 04 '21 at 01:53
  • I don't know the mac os equivalent of `sudo tee /proc/sys/vm/drop_caches <<<3`, writing `3` to that file gets linux to flush all the things, diff-index after that is predictably slow.. – jthill Apr 04 '21 at 01:56
  • @CryptoFool oops, sorry, didn't notice you weren't op – jthill Apr 04 '21 at 02:06
  • Thanks for your answer! But it's still not clear for me after reading the answer and comments how git detects untracked (newly added) files without traversing the directory tree – LNK Apr 05 '21 at 09:07
  • I didn't answer for that case, perhaps you weren't aware you'd specifically excluded it from consideration? diff-index, the command you were investigating, works off precanned file lists (written tree or current index) and does not detect new, untracked files at all. – jthill Apr 05 '21 at 10:17