2

I'm trying to improve performance of a git repository that is being used almost exclusively by me to version a scientific computing project. The project's simulation software blasts teeny (less than 100KB) plaintext files into fairly deep directories, representing separate, relatively economical simulation results. I point out that these are economical to indicate that I can create many thousands of them over the course of a short amount of time, which means this is just going to keep getting worse. These simulations are run as batches, which can mean that individual commits can include several hundred MB of data, all in the form of these deep sub-trees populated with teeny text files. The institutional computing cluster I am running this on uses a 33TB RAID6 array of platter drives to store all my group's data (if it matters, this drive doesn't have a ton of headroom by percentage at the moment--about 1.6 TB).

I'm reasonably sure this is bad performance on the RAID6 array's part, because when I run a top-level git add . it can take tens of minutes, even if only a few files have changed. Committing is just as bad. Pushing, once things are committed usually still takes minutes, but is a bit faster (and the slow part of the push is not the part where it sends the data over the network). Doing all of this in an interactive session where I've requested extra cores also speeds things up, but it can still take minutes to finish adding new simulation results. When I do the same on my laptop, which has a modern NVME-PCIE SSD in it, these operations take seconds.

So, any advice? I looked at git lfs, but am not convinced this would help me a ton because the pointers it would create are not a million times smaller than the files they'd be pointing to, which is the normal use case. If people still think it'd help I guess I can give that a try. Also, if it matters, the cluster's linux is old (of course) so: git version 1.8.3.1...

Happy to add more context if needed. EDIT git count-objects -vH returns:

count: 1
size: 4.00 KiB
in-pack: 229216
packs: 1
size-pack: 1.25 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

P.S. I did add the large-data tag even though my data can comfortably fit on one device's storage medium. I added it because the data has become large/complicated enough to become unwieldy, as the post explains. If people think that's really inappropriate I can remove it.

LGS
  • 110
  • 8
  • 1
    That's not just old, 5 years old is ancient in Git terms. I've read about improvements made specifically for repositories with many files in releases more recent than that. Consider upgrading the cluster's Git installation. See https://stackoverflow.com/questions/46920246/git-fetch-for-many-files-is-slow-against-a-high-latency-disk – CodeCaster Mar 29 '21 at 17:01
  • That wound is beyond my power to heal. I can request it, but I think I need to move forward under my own steam. I don't think that it is too old to use `git lfs`. Do you think that performance in my situation would be materially better if I had a more recent git? – LGS Mar 29 '21 at 17:03
  • Ah our responses were out of sync. I'll ask my IT if it's possible to upgrade. – LGS Mar 29 '21 at 17:20
  • Yeah, I edited after you responded, and you edited after that. AFAIK, Microsoft made various performance-improving commits to Git over the past few years, because they have repos with tens if not hundreds of thousands of source files (see https://devblogs.microsoft.com/devops/microsofts-performance-contributions-to-git-in-2017/). Your 2015 version is older than those improvements which I believe were among other significant improvements. Granted, the access time of your NVMe disk will help, but which version does that laptop run? Try comparing with 1.8.3.1 on your laptop. – CodeCaster Mar 29 '21 at 17:26
  • Ha that's not a bad idea. My native git on my laptop is much newer-`git version 2.25.1`. I believe it is the native version on an up-to-date Ubuntu 20.04, which my laptop is. I wonder if I can roll back to an old git without pain... – LGS Mar 29 '21 at 17:29
  • Apart from that, I'm not entirely sure that git repos are backwards compatible, i.e. that a repo created with that 2.25 version is still readable with an 1.8 version. It should be, but not sure that stuff keeps working, so make sure to keep a copy (and/or upstream) around. – CodeCaster Mar 29 '21 at 17:38
  • I used conda-forge to locally install git 2.30.2 and tried to run gc, which is something that I know historically has taken a long time for this repo. With the new updates, I don't see a big improvement. I used the unix utility time to time the gc and the result was: ``` real 62m11.552s user 231m51.624s sys 0m10.071s ``` – LGS Mar 29 '21 at 19:04
  • The repository formatting is backwards-compatible. The first GC will be expensive, but subsequent ones (without `--aggressive`) should be faster. – torek Mar 30 '21 at 00:33
  • Yes, as an update, at least pulls and merges seem much faster now. I'm going to 'answer this myself' unless @CodeCaster wants to write what was said in the comments in the answer field instead so that I can close the question. – LGS Mar 30 '21 at 15:33
  • Feel free to self-answer, it was just a hunch! – CodeCaster Mar 30 '21 at 16:04

1 Answers1

1

As @CodeCaster pointed out, the git on my cluster was indeed ancient and this was in part the source of the problem. I'm not totally convinced that the raid array on my school's cluster isn't just slow somehow, but after updating to a more recent git my pulls, pushes, adds and commits have all become far less painful. They've gone from taking tens of minutes to a handful of seconds (which is more the speed I'm used to).

For what it's worth, this SO answer is what convinced me to try to upgrade git (again, thanks @CodeCaster). As @torek has pointed out, the repos are backwards compatible, so there have been no issues handling my repo that was being handled by a git from 2015 with a git from this year.

If anyone reading this concludes that it would be annoying for them to pursue this solution because they don't have root on their shared infrastructure, my approach was to use conda to install a different git in the conda environment I was working with anyway. As of this post conda install -c conda-forge git in a clean miniconda3 env will get you git 2.30.2, which is plenty current. The most recent performance update mentioned in the other SO post is in version 2.24. I suppose there are other avenues to a local git installation, but in a scientific computing environment where there's usually a local conda available for a user without too much trouble this seemed like the easiest path to a newer version.

LGS
  • 110
  • 8