11

I'm watching files for changes using inotify events (as it happens, from Python, calling into libc).

For some files during a git clone, I see something odd: I see an IN_CREATE event, and I see via ls that the file has content, however, I never see IN_MODIFY or IN_CLOSE_WRITE. This is causing me issues since I would like to respond to IN_CLOSE_WRITE on the files: specifically, to initiate an upload of the file contents.

The files that behave oddly are in the .git/objects/pack directory, and they end in .pack or .idx. Other files that git creates have a more regular IN_CREATE -> IN_MODIFY -> IN_CLOSE_WRITE chain (I'm not watching for IN_OPEN events).

This is inside docker on MacOS, but I have seen evidence of the same on docker on Linux in a remote system, so my suspicion is the MacOS aspect is not relevant. I am seeing this if watching and git clone are in the same docker container.

My questions:

  • Why are these events missing on these files?

  • What can be done about it? Specifically, how can I respond to the completion of writes to these files? Note: ideally I would like to respond when writing is "finished" to avoid needlessly/(incorrectly) uploading "unfinished" writing.


Edit: Reading https://developer.ibm.com/tutorials/l-inotify/ it looks like what I'm seeing is consistent with

  • a separate temporary file, with name like tmp_pack_hBV4Alz, being created, modified and, closed;
  • a hard link is created to this file, with the final .pack name;
  • the original tmp_pack_hBV4Alz name is deleted.

I think my problem, which is trying to use inotify as a trigger to upload files, then reduces to noticing that the .pack file is a hard link to another file, and uploading in this case?

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
  • The answer might be somewhere [here](https://github.com/git/git/blob/master/packfile.c)... – choroba Jan 22 '20 at 17:17
  • @choroba You might be right... I see lots of references to mmap, and inotify does not report mmap access to files – Michal Charemza Jan 22 '20 at 17:36
  • 1
    BTW what is the original problem you're trying to solve (with inotify)? May be there exists some more robust solution that trying to second-guess what a Git process is doing/has done to a repository? – kostix Jan 22 '20 at 17:42
  • @kostix This is part of https://github.com/uktrade/mobius3, syncing users’ home folders from containers running JupyterLab or RStudio in AWS Fargate, to and from S3, and in those home folders there can be .git folders. I know the inotify solution won’t ever be “robust-robust”... but I am hoping it can be “robust enough”. – Michal Charemza Jan 22 '20 at 17:49
  • Is this relevant to your use-case? Particularly the accepted answer? https://stackoverflow.com/questions/17123108/notify-signal-when-memory-mapped-file-modified – tink Jan 22 '20 at 18:41
  • 1
    @tink It looks like the accepted answer is a patch on the Linux kernel? It would work I suspect in general, but in my case on Fargate I don’t have that control. (And I admit I slightly fear consequences of depending on a patched kernel in the long term even if I had that power...) – Michal Charemza Jan 22 '20 at 19:08
  • Heh. Fair enough. Guess I would dread that, too. – tink Jan 22 '20 at 19:12
  • The folks working on Git at Microsoft have recently introduced a new tool, [Scalar](https://lore.kernel.org/git/b25ebb55-a533-4c5a-43cc-667bf88bc1d5@gmail.com/) which makes extensive using of filesystem monitoring; you might find it interesting. – kostix Feb 14 '20 at 12:39

5 Answers5

5

To answer your question separately for git 2.24.1 on Linux 4.19.95:

  • Why are these events missing on these files?

You don't see IN_MODIFY/IN_CLOSE_WRITE events because git clone will always try to use hard links for files under the .git/objects directory. When cloning over the network or across file system boundaries, these events will appear again.

  • What can be done about it? Specifically, how can I respond to the completion of writes to these files? Note: ideally I would like to respond when writing is "finished" to avoid needlessly/(incorrectly) uploading "unfinished" writing.

In order to catch modification of hard links you have to set up a handler for the inotify CREATE event which follows and keeps track of those links. Please note that a simple CREATE can also mean that a nonempty file was created. Then, on IN_MODIFY/IN_CLOSE_WRITE to any of the files you have to trigger the same action on all linked files as well. Obviously you also have to remove that relationship on the DELETE event.

A simpler and more robust approach would probably be to just periodically hash all the files and check if the content of a file has changed.


Correction

After checking the git source code closely and running git with strace, I found that git does use memory mapped files, but mostly for reading content. See the usage of xmmap which is always called with PROT_READ only.. Therefore my previous answer below is NOT the correct answer. Nevertheless for informational purpose I would still like to keep it here:

  • You don't see IN_MODIFY events because packfile.c uses mmap for file access and inotify does not report modifications for mmaped files.

    From the inotify manpage:

    The inotify API does not report file accesses and modifications that may occur because of mmap(2), msync(2), and munmap(2).

Ente
  • 2,301
  • 1
  • 16
  • 34
  • My changes detection mechanism depends on `IN_CLOSE_WRITE`, which I think would still be triggered when closing a file that was written to using `mmap`, because the file would have had to have been opened in a write mode? – Michal Charemza Jan 27 '20 at 09:26
  • I have to investigate this, but I would suspect that a memory mapped file does not trigger any inotify events at all. Most of the intoify events are linked to a state of the file descriptor, but when you `mmap` a file things can get a bit out of order. For example, you can still write to a closed file descriptor when you have the file mapped into memory. – Ente Jan 27 '20 at 18:20
  • Scratch that, i just tested [this example implementation](https://gist.github.com/marcetcheverry/991042#file-mapwrite-c) and i do get a `CLOSE_WRITE_CLOSE` even if i do remove the `close` and `munmap` at the end. Have to dig deeper into the actual git implementation then.. – Ente Jan 27 '20 at 18:35
  • Hmm I am struggling a bit to reproduce your issue. In my tests with `inotifywait` and `git clone` (2.24.1) I do get a `OPEN` -> `CLOSE_NOWRITE,CLOSE` for the `*.idx` files. Maybe you forgot to set up a handler for `CLOSE_NOWRITE,CLOSE`? Note: You'll get a `*NOWRITE*` because all the writes happened through the memory mapped are. – Ente Jan 27 '20 at 18:50
  • Yes, there are `CLOSE_NOWRITE`: the issue is I don't see `IN_CLOSE_WRITE`, and I would like to respond to file "changes" to trigger an upload, but ignore file "reads". Note, I actually think right now the mmap+inotify limitation is a bit of a red-herring. I think issue is that the `.pack`/`.idx` files are initially created as hard links to another file, and so only trigger `IN_CREATE` (and the `OPEN` -> `CLOSE_NOWRITE` happens later when git is actually reading the files). – Michal Charemza Jan 28 '20 at 11:05
  • Agree, when you do a local clone on the same partition without `--no-hardlinks` git just creates hardlinks and you will never see a `CLOSE_NOWRITE`. I think for your use case you will have to schedule an upload of the file at least on `CLOSE_WRITE` and `CREATE`. The latter is there to catch the case when a user just creates a hardlink. – Ente Jan 29 '20 at 09:04
2

I may speculate that Git most of the time uses atomic file updates which are done like this:

  1. A file's contents is read into memory (and modified).
  2. The modified contents is written into a separate file (usually located in the same directory as the original one, and having a randomized (mktemp-style) name.
  3. The new file is then rename(2)d -d over the original one; this operation guarantees that every observer trying to open the file using its name will get either the old contents or the new one.

Such updates are seen by inotify(7) as moved_to events—since a file "reappears" in a directory.

kostix
  • 51,517
  • 14
  • 93
  • 176
  • Ah for some files I think it does this: I see the the various `IN_MOVED_FROM` and `IN_MOVED_TO` events. However, I don't see this happening for the `.pack` and `.idx` files – Michal Charemza Jan 22 '20 at 17:37
  • Pack files may be huge (several gigabytes, up to 2GiB at least, I beleive); wielding them using atomic updates might be prohibilive on storage space, so they might be updated using some other strategy. – kostix Jan 22 '20 at 17:40
2

Based on this accepted answer I'd assume there might be some difference in the events based on the protocol being used (i.e. ssh or https).

Do you observe the same behavior when monitoring cloning from the local filesystem with the --no-hardlinks option?

$ git clone git@github.com:user/repo.git
# set up watcher for new dir
$ git clone --no-hardlinks repo new-repo

Your observed behavior on running the experiment on both a linux and Mac host probably eliminates this open issue being the cause https://github.com/docker/for-mac/issues/896 but adding just incase.

deric4
  • 1,095
  • 7
  • 11
2

There is another possibility (from man inotify):

Note that the event queue can overflow. In this case, events are lost. Robust applications should handle the possibility of lost events gracefully. For example, it may be necessary to rebuild part or all of the application cache. (One simple, but possibly expensive, approach is to close the inotify file descriptor, empty the cache, create a new inotify file descriptor, and then re-create watches and cache entries for the objects to be monitored.)

And while git clone can generate heavy event flow, this can happen.

How to avoid this:

  1. Increase read buffer, try fcntl(F_SETPIPE_SZ) (this approach is a guess, I've never tried).
  2. Read events into a big buffer in a dedicated thread, process events in another thread.
2

Maybe you made the same mistake I made years ago. I've only used inotify twice. The first time, my code simply worked. Later, I no longer had that source and started again, but this time, I was missing events and did not know why.

It turns out that when I was reading an event, I was really reading a small batch of events. I parsed the one I expected, thinking that was it, that was all. Eventually, I discovered there is more to that received data, and when I added a little code to parse all events received from a single read, no more events were lost.

donjuedo
  • 2,475
  • 18
  • 28