5

I am using Git for Windows (version 2.15, but the same issue occurs in 2.14 and I think older versions as well) and I noticed a rather annoying behavior: When I perform some basic git operations*), the modification date of the .git/objects/pack/pack-*.pack file changes. The file itself remains unchanged, but the last modification date field gets updated, which causes my backup software to think the file was changed and needs to be added to my differential backup. Because my .pack files are rather large, this increases the size of my daily backups significantly. Is there a way to prevent this behavior? That is, keep the pack file completely unchanged, including its metadata, until I perform a git gc or git repack?

Unfortunately, I wasn't able to pinpoint which operation causes this behavior. When it happened today, I only used git status, git log, git add, git mv and git commit and nothing else and the date/time got changed, but when I tried to replicate the behavior on my yesterday's backup, the date change didn't occur. I guess next time I will run Process Monitor and watch accesses to the file, but in the meanwhile, does anyone have an idea of what might be causing this problem? Thanks.

pepak
  • 712
  • 4
  • 13
  • Possible duplicate of [Git 2.2.x updates timestamps of old pack files for no good reason](https://stackoverflow.com/questions/27454259/git-2-2-x-updates-timestamps-of-old-pack-files-for-no-good-reason) – Jukka Suomela May 31 '18 at 23:11

3 Answers3

2

Instead of referencing your Git repo itself for your backup program to process (with the date issue), you could have:

  • a task which does a git bundle of your repo (that generates only one file)
  • your backup program would back up only that one file.

That way, you bypass entirely the modification date issue for those pack files.

You can either save and keep only one copy of a full bundle of the repo.
Or make incremental bundles.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Thanks, but unfortunately that is not feasible. Using a git bundle, I would backup the whole repository every single time, completely removing the point of differential backups. At least now, only the changed files are saved rather than all of them. Incremental backups are difficult to use properly. Plus, they wouldn't really help at all - every time the date-modification issue occurred, I would have a huge backup again. – pepak Dec 21 '17 at 06:06
  • @pepak no: you can do incremental bundles: and they won't be affected by the pack files date change. – VonC Dec 21 '17 at 06:19
  • @pepak For incremental bundles, see my answers https://stackoverflow.com/a/24287531/6309 and https://stackoverflow.com/a/23712022/6309, pointing to my old save_bnudles script: https://github.com/VonC/compileEverything/blob/1b01af253eb938efe8f04eb44f9e8af0d9633baa/sbin/save_bundles – VonC Dec 21 '17 at 06:59
  • I'll have to think a bit about this. Seems that for my use case (multiple repositories in many directories) it's a more complicated solution than writing a simple tool which would rewrite the timestamp back to the original value, but I will definitely give it a try. Thanks for the suggestion. – pepak Dec 21 '17 at 09:41
  • @pepak Yes, bundle are best: backuping one file is easier than backuping a folder. And a bundle can be incremental. – VonC Dec 21 '17 at 09:42
2

In the end it turns out that Edward Thomson's answer explains why no "real" solution is possible. However, to facilitate my needs, I wrote a simple Windows command-line application which scans through a tree of directories, locates possible Git repositories, locates their packfiles and changes the date/time of each .pack file to that of the respective .idx file. So far it seems to run OK. I did not encounter any garbage collection issues yet, anyway. I did not release the tool yet, because I rather suspect no one else cares, but if someone is interested, I can upload it somewhere.

Apparently, someone is interested. So the program is released as of now. Not on GitHub, but still as open source, under the 3-clause BSD license. Download the binaries here: https://www.pepak.net/files/git/gitpacksync-0.01.zip and the source code here: https://www.pepak.net/files/git/gitpacksync-0.01-source.zip

pepak
  • 712
  • 4
  • 13
1

If you try to disable this then you would be prone to see subtle bugs where objects that are still in use will disappear from your repository.

You had trouble pinpointing the exact operation because every operation that adds files will do it.

This is very much intentional - Git refreshes the timestamps of objects in the database (updating the timestamp on either loose objects or packfiles) to know when an object was last written. Whenever you create a new commit, it will update the timestamp on all the files that contain objects hat were referenced.

This is important as it helps the tools that remove data (like prune) avoid race conditions: an object may be dereferenced and then re-referenced. Prune will also look at the timestamp, so by touching the file, it will not be eligible for garbage collection.

Edward Thomson
  • 74,857
  • 14
  • 158
  • 187
  • Sounds reasonable with standalone files. I am not so sure about the pack files - why would a garbage collector ever delete a pack file, except when used with git gc --prune-all? But that's beside the point - now I at least know what's causing the issue. The question is, however, can I do anything at all to stop it? Even if it means possible race condition errors (the risk of these is minimal, I think, while the annoyance of having daily multi-megabyte transfers of backups is common). – pepak Dec 21 '17 at 09:39
  • For packfiles, all objects would be repacked, into a new packfile, omitting unused objects and the original packfiles would be removed. This is done for performance (one packfile is much faster than multiple packfiles) as well as garbage collection. – Edward Thomson Dec 21 '17 at 09:51