1

I got a git repository containing 11 different and independent projects (don't ask me why the **** they are all in one repository). Because some of the projects containing many assets, gitlab says that the size of the repo is about 14.3 GB and that causes huge checkout times (on our CI/CD system up to 20 minutes).

Because we only build one of the projects at a time, I want to separate all projects to different repositories. Because Project A does not need commits related to files of Project B, I want to cleanup the whole history.

I already tried different ways:

  1. Deleting the files. The files are gone, but still available via history.
  2. Using a simple git filter-branch --prune-empty, but I want to keep the file structure.
  3. Using git filter-branch --index-filter --prune-empty with git rm --cached --ignore-unmatch, but I can still recover old files.
  4. Deleting the files and using Git BFG with --delete-folders. Great result, but I can only provide a glob/regex and some Projects contaiing folders with the name of other projects (bad naming...) which are also wiped out...

The best would be a tool/command working like BFG, but which allows me to provide paths to delete or better paths to keep.

Example of the file structure:

./
+- Project A/
+- Project B/
+- UI Projects/
|  +- Foo/
|  +- Bar/
+- Project E/
|  +- Foo/
|     +- Bar/
+- Build
   +- build_a/
   +- build_b/
   +- build_foo/
   +- build_bar/
   +- build_e/

My requierments are:

  • preserved file structure
  • keep multiple paths (e. g. ./Project A/ and ./Build/build_a/ for Repo A)
  • the history of files which are no longer part of the new repo are wiped out

Any suggestions?

D. Weber
  • 503
  • 5
  • 21
  • 2
    Does this answer your question? [Detach (move) subdirectory into separate Git repository](https://stackoverflow.com/questions/359424/detach-move-subdirectory-into-separate-git-repository) – krisz Apr 06 '20 at 15:18
  • @krisz thanks for your response, unfortunately it was not so helpful. I added a list of my requierments, to make it more clearly. – D. Weber Apr 06 '20 at 16:38

2 Answers2

2

The following tree-filter satisfies your requirements:

find . ./Build -maxdepth 1 -path . -o -path ./Build -o -path "./Project A" -o -path ./Build/build_a -o -exec rm -rf {} +

Replace Project A and build_a with the actual project name. You can add other paths following the example of the ./Build folder.

Pass it to the --tree-filter option of filter-branch:

git filter-branch --tree-filter '...' --tag-name-filter cat --prune-empty -- --all
krisz
  • 2,686
  • 2
  • 11
  • 18
1

Well... you're kind of missing a bigger piece of the problem here, but I'll come back to taht. To address your question as asked:

Of the options you've tried, filter-branch is the one that should have worked. (Be advised that git has a new tool, filter-repo, that they recommend over filter-branch; but I haven't taken the time to switch over, and it sounds like you have a nearly-working filter-branch procedure anyway, so I'll address the answer using filter-branch...)

So, you say you could still recover the deleted files after using filter-branch with index-filter. There are several possible reasons for that, but generally the point is that git tries to avoid losing data unless it's really sure you no longer want it. So:

  • filter-branch creates a set of "backup refs" whenever it rewrites a repo's refs. Those "backup refs" can still reach the old histroy
  • the reflogs for your branches provide a way to go back to where those branches previously pointed; those historical reflog entries can still reach the old history

The easiest way to do away with all of that is to reclone from the repo where you did the clean-up. If you really want to clean it up in place, you need to (1) delete the refs under the original namespace; (2) expire or delete the reflogs - I've always had trouble getting git to expire them, but if all else fails rm -r .git/logs; (3) run gc. For this type of operation I use gc --force --aggressive --prune=now.

Now... the bigger probelm is, if the histories of 11 projects combined are 14.3GB, then the history of each project is (on average) over 1GB - and that's still ridiculous. You have a deeper problem. Splitting the repos is, IMO, a good idea (I'm not a fan of the "monorepo" trend); but you should also be trying to reduce the overall size of the repo.

Most likely you have large binary files under source control. Very rarely is that advisable. If you do need to do it, you should use a tool like git lfs to keep the core repo small and manageable. But if you're just storing build artifacts, or dependencies, or something like that, you would be better served to look into an artifact repository (artifactory, nexus, ...). This may require improved build tooling to manage dependency versions

Mark Adelsberger
  • 42,148
  • 4
  • 35
  • 52
  • My problem with `filter-branch` is that I cannot choose multiple paths to keep. I upaded my question with a list of my requierments to make it more clearly. I'll try to clear the clear the possible unwanted references and also take a look at `filter-repo`, thanks. And yes, you're absolutly right that the next step have to be to reduce the size of each repo. – D. Weber Apr 06 '20 at 16:48
  • 1
    @D.Weber I've read your updated question, as well as this comment, and I'm not understanding the problem. Using an `index-filter` you can keep exactly what files you want, regardless of paths. Remember you can give filter-branch multiple pathspecs, and each pathspec can be either a pattern to include or exclude – Mark Adelsberger Apr 06 '20 at 18:50
  • nvm, I forgot `index-filter`, sorry. I will try it out asap :) – D. Weber Apr 07 '20 at 05:04