2

I want to do some work on a subset of the FreeBSD repository. Problem: this repository is very large; git clone pulls down close to 2 GB. I only need a tiny fraction of that for what I want to do; currently, about 140 KB.

I want to be able to pull changes from upstream (I'd really rather not have to apply patches), but I estimate the chance I'll need to push back at 0%.

It seems like every path I turn down is a dead end:

  • If I clone the upstream repo with --depth 1, I can't push it to Github. ("shallow update not allowed". Git 2.7.4 on Ubuntu 16.04)
  • Even if I git rm everything I don't want (leaving just that 140KB in the working directory) and then clone --single-branch, it pulls down 1.5 GB. I wondered if maybe just the packs are awful and there's a lot of "false sharing", but I tried to repack (-a -d -f --depth=250 --window=250, per some random command I saw) and it is still ~880 MB after. Same if I clone that again.
  • I tried git gc and that just went and made things far, far worse (6.6 GB).
  • I could filter-branch away the unneeded stuff, but it seems like I won't be able to pull afterwards if I do that.

Is there some workflow that will work here, or should I just sever the connection to upstream, filter-branch everything away, and then just pull in patches as there are new commits to upstream? Should I forget about the FreeBSD Github mirror and use git-svn somehow to make the repo? (Eventually, everything I want won't be contained in a single directory; i.e., I'll want foo/bar and foo/baz but not foo/qux.)

(And what'd be the best way to get and apply those patches?)

EvanED
  • 947
  • 6
  • 22

2 Answers2

1

Even if I git rm everything I don't want (leaving just that 140KB in the working directory) and then clone --single-branch, it pulls down 1.5 GB

Yes, Git would download (fetch) the all repo anyway, but only on the first push.
But that should not prevent you to push back, if you commits are limited in scope (modify a few files, and that push should proceed without issue)

What you can do in order to limit the working tree locally is a sparse checkout (it would still require a complete fetch at first, but won't checkout everything).
You can see an example of sparse clone in "git clone is not cloning recent version of a certain repository?"

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Sure, pushing back works fine, and I can avoid the working copy size with sparse checkouts, as you say. But I'd like to avoid the multi-GB download in the first place. (To be more specific: a sparse *checkout* I think doesn't help me at all; I'm going to have either a branch or a whole repo fork that only has the subset I'm interested in at all. So as long as I'm on that branch, which I will be all the time I'm doing active work on it, I can have a non-sparse checkout.) – EvanED Jun 23 '17 at 15:28
  • @EvanED Yes, that's my point: if you can support the multi-gB *once* (first fetch), that's ideal. I don't know of a way to do partial fetch (beside a shallow clone). I still use sparse-checkout (for instance https://stackoverflow.com/a/2467629/6309) – VonC Jun 23 '17 at 17:18
  • @EvanED "I'm going to have either a branch or a whole repo fork that only has the subset I'm interested in at al": Then you need to split that repo into tow repos (one being a submodule of the other): https://stackoverflow.com/a/16728814/6309 – VonC Jun 23 '17 at 17:19
1

There are signs that future versions of git will support it. The patches are already accepted. Search for OPT_PARSE_LIST_OBJECTS_FILTER or add object filtering for partial fetch

basin
  • 3,949
  • 2
  • 27
  • 63