0

We are migrating from Azure DevOps Git to GitHub. The repo is huge, old, unfortunately has binaries and with tons of branches and tags. We decided on a cut-off date and want to drop all history before that date (which will also remove the binaries and large files as they were later deleted) We want to retain only specific branches from the selected date and hopefully keep the tags.

Got completely lost with filter-branch and haven't been able to find a good and fast way of doing this. This simplest thing I found was doing an orphan checkout from what we want as the new root commit, rebasing and then prune and run garbage collector. But, the new root commit is dated to now, all commit IDs change, we lose all the tags and I couldn't do it for all branches I want to retain.

What is the best way of achieving this?

Mickey Cohen
  • 997
  • 7
  • 23
  • You can set the date for the root commit by using environment variables when you commit (and do not expect the commit IDs to remain the same). – eftshift0 Jun 20 '23 at 08:31
  • 1
    Also, upstream recommends to use `git filter-repo` instead of `filter-branch`. https://github.com/newren/git-filter-repo – eftshift0 Jun 20 '23 at 08:32
  • @eftshift0 Thanks, but how do I apply on multiple branches? – Mickey Cohen Jun 20 '23 at 08:37
  • "We decided on a cut-off date and want to drop all history before that date (which will also remove the binaries and large files as they were later deleted)". My gut feeling is you chose to have a cutoff date mainly because you wanted to strip out the large junk files. Note with `git-filter-repo` it's very easy to strip out the large files without having a cutoff date at all. Perhaps knowing that might change whether you truly wish to throw away old (potentially useful) history now that you know you can still accomplish your goal while retaining it. – TTT Jun 20 '23 at 20:22
  • Side Note: I used `git-filter-branch` in the past on a large repo and it took 2 days. Years later I used `git-filter-repo` on the same even larger repo, and it took 5 minutes. (I've never used `git-filter-branch` since...) – TTT Jun 20 '23 at 20:27
  • @TTT can you help with an example ? – Mickey Cohen Jun 21 '23 at 17:50
  • [This answer](https://stackoverflow.com/a/74309661/184546) may help you. – TTT Jun 21 '23 at 18:11
  • If it were me I'd first try to just delete the big files without truncating history. But if you decide to go with your original idea and can identify a commit you wish to start from, then [this answer](https://stackoverflow.com/a/74001898/184546) may also help. – TTT Jun 21 '23 at 18:14

1 Answers1

2

The trick is to use grafts to fake new root commits, then burn them into the history using git filter-branch or git filter-repo.

Let's say, you determined that commit 1234abcd is the new root commit and it is the only one needed. Then

git replace --graft 1234abcd

installs a replacement commit that pretends that 1234abcd has no parents. Now run

git filter-branch --tag-name-filter=cat master branch1 branch2 tag1 tag2 ...

(or an equivalent git filter-repo command). This rewrites commit 1234abcd to really have no parent (and results in a different commit name, of course) and rewrites the history up to the specified refs.

You should be able to repeat the command with different branch and tag names, should you forget some or if you want to do the job incrementally. Make sure to specify only refs whose history does not bypass the root commit (this could happen accidentally if there are merges from history before the un-rewritten commit into history after the new root commit).

j6t
  • 9,150
  • 1
  • 15
  • 35
  • It seems it's a bit more complicated. It is not the same commit in all the branches I want to retain. Is there a way to filter by date? I can iterate but I want to avoid running filter-branch more than once because it takes over 18 hours... – Mickey Cohen Jun 20 '23 at 10:05
  • If it's not the same commit on all branches, it means that you will have several disconnected histories, right? Then you can run one `git filter-branch` for every disconnected history. How to find the commits by date warrants a new question. – j6t Jun 20 '23 at 11:00
  • I can find the commits, just wanted to avoid iterating through the branches and running filter-branch one, hopefully by date. I understand this is not possible. My problem is the long time it takes but I guess it can't be avoided... – Mickey Cohen Jun 20 '23 at 11:06
  • It's absolutely possible to run `git filter-branch` only once, provided you know the complete list of branches and tags that must be rewritten. – j6t Jun 20 '23 at 11:15