1

I have a git project with 200+ individual remotes, all of which I fetched to my local machine. Yes, maybe this wasn't the right thing to do. I'm honestly not sure. It was the recommended way of getting the data, but it's not worked out as well as I'd hoped. What I want to do is to split the repository into separate local repositories, each a clone of one of the remotes in the "big" repo. But I don't want to fetch the remotes again - the initial fetch took about 12 hours and I'd rather avoid spending the time (and bandwidth) again.

The repository I currently have downloaded has about 230 remotes, named pypi-mirror-N.git for N from 1 to about 230. There's no working tree in the repository, all of the data is only in the pack files, and is accessed using commands like git rev-list or git cat-file. Each remote in the repo has two branches: remotes/pypi-mirror-N.git/code and remotes/pypi-mirror-N.git/main. The repo itself has no commits on the main branch (it was created via git init, a series of git remote add and then a git fetch).

A repository this size is, however, unusably slow for me (on a relatively high-end Windows PC) - git rev-list --objects --all | wc -l takes 30 minutes or more.

An alternative approach would be to git clone each repository individually. Doing so would mean I'd need to keep track manually of which repository each individual file was in, but that's an acceptable trade-off for good performance. Individual repos have the same structure (no working tree, the data just in the packs) with 4 branches: main, remotes/origin/HEAD -> origin/main, remotes/origin/code, and remotes/origin/main.

Rather than cloning all of the repos again, I'd much rather save on bandwidth and time by copying the relevant data from my current "all in one" repo, into individual repos created locally. I'm assuming I need to copy remotes/pypi-mirror-N.git/code in the big repo to remotes/origin/code, and remotes/pypi-mirror-N.git/main to all of main, origin/main, and remotes/origin/main. But it's not clear to me how to do that.

Is this possible? Ideally, I'd prefer it if I could avoid having 2 copies of the data locally as that would use a significant chunk of my disk.

Paul Moore
  • 6,569
  • 6
  • 40
  • 47
  • You say you want to split them but also not have duplicate repositories lying around? There's two ways: you might get away with keeping a single repo and simply having [200+ worktrees](https://git-scm.com/docs/git-worktree) (then only the working copies are redundant, not the whole repository) **or** have 200+ repositories, but use [alternates](https://stackoverflow.com/questions/36123655/what-is-the-git-alternates-mechanism) to keep the duplication in check. Honestly worktrees are probably the more straightforward, easy to understand approach. – Joachim Sauer Sep 01 '23 at 15:25
  • You've got the histories locally, just fetch them from there. – jthill Sep 01 '23 at 15:46
  • @jthill that's what I don't know how to do. The repositories are bare - they have no working tree, just a .git directory, because the project is relying on git compression. The uncompressed data is 55TB in size, the compressed files under .git are only 350GB. I don't have disk space for an uncompressed tree, I just get what on demand I want using `git cat-file`. Sorry if I'm being unclear, I don't know the terms to describe what I want any better :-( – Paul Moore Sep 01 '23 at 16:23
  • Thing is, reading your question, that `git branch -a` result looks like it's for just one mirror, as if you did all the distinct clones already. What exact problem are you trying to solve with this reorg? "It's not working out as well as I hoped" isn't much good as a guide to people trying to help. – jthill Sep 01 '23 at 17:32
  • A single repo with 220 remotes and some hundreds of thousands of objects is too slow to be usable, basically. Splitting it into 220 individual repos makes operations (fetches, queries, etc) run in usable amounts of time, at the cost of having to know which individual repo to look in - a trade-off that is reasonable for my use case. The `git branch -a` *is* for one repo, which I cloned manually. What I want to know is how to replicate that structure using the data I've already downloaded, rather than re-downloading. – Paul Moore Sep 01 '23 at 20:56
  • I've updated the question to be clearer (I hope!). My original wording was way too terse, and unclear as a result. Sorry. – Paul Moore Sep 01 '23 at 21:10

0 Answers0