Clean git branch tree

Question

I have a commit tree that has become kind of difficult to make out and disorganized. So I feel I need to clean it. It'd be awesome if you can help me fix it.

I was working on project A which at some point was extended with two branches B and C. After a few commits in each branch I see the following when I log on each of the branches. I have summarized the logs, let me know if it's not readable.

master:
commit: B2 (HEAD -> master, origin/master, origin/HEAD)
commit: B1
commit: A3
...

branch_B:
commit: B2 (HEAD -> branch_B)
commit: C3 
commit: C2 (branch_C)
commit: C1
commit: B1
commit: A3
...

branch_C:
commit: C2 (HEAD -> branch_C)
commit: C1
commit: B1
commit: A3
...

EDIT: Specifically, I want to remove B1 and B2 from master, remove B1 from branch_C, remove C1, C2, and C3 from branch_B and move C3 back to branch_C.

So what exactly is the question here? Do you want to delete branches (`git branch -d branch-name`), squash commit history, or something else? "Clean it" is a bit vague. — Striezel, Feb 02 '19 at 21:51
Thanks Striezel. I added specifics of my question at the end. — soroosh.strife, Feb 02 '19 at 21:58

score 0 · Answer 1 · answered Feb 02 '19 at 23:38

Git is built to, and hence "wants to" in some sense, add new commits but never remove any old ones. It is possible to remove commits, but:

Be very sure you really want to do this!
Be aware that commits are like certain diseases: if you've had commit X in your repository, and ~~exchanged fluids~~ had interactions with another Git repository that's a clone of the same source, or a clone of yours or you're a clone of theirs, they probably have commit X too now. The next time you connect your Git to their Git, you're likely to get commit X back again. To make some commit really go away, you must ~~cure~~ remove the problem from all affected / infected Git repositories. Since, in general, you only control your own Git repository, that means you must get everyone else to fix their repositories too.

With that out of the way, here's how you do it, using git cherry-pick and git reset. There is more than one way to do it but let's go with these two commands here.

Git's thing is the commits; the names are secondary

As you've already seen, every commit has a unique hash ID—some big ugly string such as b5101f929789889c2e536d915698f58d5c5c6b7a. These IDs are the same across every Git that shares this repository. (The one I've listed here is a commit in the Git repository for Git itself.)

Each commit retains, for as long as the commit itself exists, a full snapshot of all the files. Well, it has all the files that are in the snapshot, but that's like saying that all blue crayons are blue: it's kind of silly. The point is that it's a snapshot of the files. It doesn't say "change README this way", which would require going back and finding how README looked before. It just says we have README and it looks like this. If the snapshot doesn't have a file, Git should perhaps remove the file (though this part gets a little trickier because Git allows you to have "untracked files"). In any case the files in the snapshot are frozen forever, or at least, for as long as the commit exists.

But each snapshot also has some metadata, such as your name (if you made the commit), when you made it, why you made it—your log message—and, crucially for our purposes, the hash ID of the previous commit. That metadata, like the files, is frozen forever, or for as long as the commit exists. Note that when you have Git show you a commit, Git shows (some of) the metadata, and then shows you the difference between this commit's files and this commit's parent's files. It can do that because of the parent, or previous, commit's hash ID, saved as part of this commit.

What this means for us is that we can draw out strings of backwards-pointing commits, with each commit naming its parent:

A <-B <-C

If the hash IDs were simple uppercase letters like this, we could just scan them all and find the last one, but they're not: they seem random (though actually they're strictly determined by all the bits saved inside the commit, which is why we can't change any of the bits inside the commit!). So Git needs a way to save the hash ID of the last commit, from which it can work backwards.

That last commit in the branch hash ID is the function of the branch names, like master:

A--B--C--D--E   <-- master

We—and Git—start at the end, by using the name master to get the hash ID (here E). Then we work backwards, following those unchange-able internal arrows.

The branch name arrows—the hash IDs stored under the names—can change, as we'll see.

Adding commits to a branch

To add a new commit to the current branch, we have Git save a snapshot of the files, add our name and email and our log message, and save the hash ID of the current commit. Git writes all of that into the new commit, which thereby acquires a new hash ID:

A--B--C--D--E   <-- master
             \
              F

Now Git just updates the name to record the new latest commit:

A--B--C--D--E
             \
              F   <-- master

which we can then straighten out:

A--B--C--D--E--F   <-- master

Note that it's the commits, and their relationships to each other—the internal, backwards-pointing arrows—that are crucial here. The names do matter, but only because that's how we find the commits. The commits themselves form a Directed Acyclic Graph or DAG. The names let us get into the DAG. Nothing in the DAG itself can ever change, but the names can move, and we can add new commits.

(We're free to draw the DAG however we want, bending the connecting arrows, as long as they still connect. I use lines rather than arrows in the text here because it's hard to find good text characters to do diagonal arrows.)

Adding more branches to the graph

Suppose we have our six commits now:

A--B--C--D--E--F   <-- master

and want to make a new branch. We use either git branch or git checkout to make the branch, so now we have:

A--B--C--D--E--F   <-- BranchA, master

The two names both point to the same commit, F. All six commits are now on both branches.

If we add a new commit, obviously we'll get:

A--B--C--D--E--F
                \
                 G

the same way we got F earlier. But which name should change? To answer that question, Git attaches the name HEAD to one of the branches:

A--B--C--D--E--F   <-- BranchA (HEAD), master

This tells Git which name to change:

A--B--C--D--E--F   <-- master
                \
                 G   <-- BranchA (HEAD)

The HEAD attachment remains when the name moves. We need to know about the attachment when we want to know: Which branch are we on? Which branch will our command affect if it affect the current branch? If we're just looking at what's in the repository, we can leave it off.

So, with that out of the way, let's draw your existing graph more completely

Your have a series of commits ending in the one you're calling A3 above, after which things get a little hairier. I like one letter names but I'll use yours here:

...--A3

Now, you say your master reaches B2 which is preceded by B1 which is preceded by A3, so there must be two more commits after:

...--A3--B1--B2   <-- master

Meanwhile your Branch_B starts out at B2, which is preceded by C3, but that's literally impossible:

...--A3--B1--B2   <-- master
           \
            C3--B2   <-- Branch_B

so you must have made some mistake in transcribing your commit hashes (not surprising since they're big and ugly and basically require careful cut-and-paste to avoid errors). I'm going to assume that the B2 on master is really some other ID, and replace it here with B2a:

...--A3--B1--B2a   <-- master
           \
            C3--B2   <-- Branch_B

Your Branch_C starts—well, ends?—with C2, which is preceded by C1, then B1, then A3:

            C1--C2   <-- Branch_C
           /
...--A3--B1--B2a   <-- master
           \
            C3--B2   <-- Branch_B

You can confirm this by using git log --decorate --oneline --graph --decorate master Branch_B Branch_C (or git log --all --decorate --oneline --graph, Get Help From A Dog). That draws vertically-oriented graphs, which aren't as pretty or obvious to me, but are still very useful.

How to get what you want: it requires changing what you want, slightly

Now, here's what you say you would like:

        C1--C2--C3   <-- Branch_C
       /
...--A3   <-- master
       \
        B1--B2   <-- Branch_B

You can't get this. We already said that there is no power anywhere to change anything in any existing commit, and looking at what we have now, the parent of commit B2 is commit C3, for instance.

But you can get something that's probably just as good, which is: you can make a copy of B2. In fact, you probably already have—B2a and B2 are likely copies of each other.

Without worrying about the exact copying mechanism yet, let's see what happens if we make a B2b that's a copy of B2 but that has B1 as its parent:

            C1--C2   <-- Branch_C
           /
...--A3--B1--B2a   <-- master
         | \
         |  C3--B2   <-- Branch_B
          \
           B2b   <-- new-branch-b

Next, let's copy C1 to a new C1a that springs from A3:

          C1a   <-- new-branch-C
         /
        /   C1--C2   <-- Branch_C
       /   /
...--A3--B1--B2a   <-- master
         | \
         |  C3--B2   <-- Branch_B
          \
           B2b   <-- new-branch-b

Then we just need to copy C2 and C3, one by one:

          C1a--C2a--C3a   <-- new-branch-C
         /
        /   C1--C2   <-- Branch_C
       /   /
...--A3--B1--B2a   <-- master
         | \
         |  C3--B2   <-- Branch_B
          \
           B2b   <-- new-branch-b

Almost-last, we need to move the old names, Branch_B and Branch_C, so that the point to commits B2b and C3a respectively:

          C1a--C2a--C3a   <-- new-branch-C, Branch_C
         /
        /   C1--C2   [abandoned]
       /   /
...--A3--B1--B2a   <-- master
         | \
         |  C3--B2   [abandoned]
          \
           B2b   <-- new-branch-b, Branch_B

Then we need to move the name master back two steps so that it points to A3 instead of B2a, abandoning B2a entirely. That's hard to draw until we stop drawing the abandoned commits. They will still be in your repository for a while (at least 30 days by default), but hidden away so that you can't see them any more, which gives us:

          C1a--C2a--C3a   <-- new-branch-C, Branch_C
         /
        /__________
       /           \
...--A3--B1         -- master
         |
         |
          \
           B2b   <-- new-branch-b, Branch_B

We can now drop the new-branch-[bc] names and clean up the arrangement of the drawing:

        C1a--C2a--C3a   <-- Branch_C
       /
...--A3   <-- master
       \
        B1--B2b   <-- Branch_B

Except for the suffixes here, which mean these are different hash IDs, this is just what you wanted!

Getting from here to there: adding new names

First, you just need to add the new names, pointing to the desired commits:

git branch new-branch-b <hash of B1>
git branch new-branch-c <hash of A3>

The hash IDs we choose here are the commits that will continue to be on the newly-built branches. For Branch_B, that's B1, which we can leave in place, but for Branch_C, that's commit A3, because we have to copy C1 to C1a.

Getting from here to there: copying commits

Now its time to copy the commits. Let's copy B2 or B2a. You can use whichever you like, as long as they make the same changes and have the same commit messages, because the copying command is git cherry-pick and the way it works is very similar to what we said earlier about showing a commit:

[Git] shows you the difference between this commit's files and this commit's parent's files

Instead of showing the difference, git cherry-pick finds the difference, then applies that to whatever commit we've checked out, makes the same changes, and commits the result, using the same log message as the original commit too.

So we just need to:

git checkout new-branch-b
git cherry-pick <hash-of-B2a or whatever>

which gets us this far, when we draw the graph and leave out a lot:

...--A3
       \
        B1--B2b   <-- new-branch-b

Then we need to build up new branch C the same way:

git checkout new-branch-b
git cherry-pick <hash-of-C1>
git cherry-pick <hash-of-C2>
git cherry-pick <hash-of-C3>

The result, again leaving out lots of graph-drawing, is the desired:

        C1a--C2a--C3a   <-- Branch_C
       /
...--A3

The last step is to make master identify commit A3, and for that we just need to git checkout master and then git reset --hard:

git checkout master
git reset --hard <hash-of-A3>

(Note: if you're doing this with hash IDs, it's a good idea to cut and paste them, and/or save them in files, as it's far too easy to get typos here. There are tricks to use relative names but I'm not going to include them in this answer.)

The git reset command affects whichever branch name HEAD is attached to, and the git cherry-pick command makes new commits on whichever branch name HEAD is attached to. That's why we had to git checkout each of those names.

At this point, we have the new branch names, and master points to A3, but we have not updated the two other branch names. As before, we can use git checkout and git reset --hard here:

git checkout BranchB
git reset --hard new-branch-b
git checkout BranchC
git reset --hard new-branch-c

We don't need hash IDs this time, because for commands like git cherry-pick and git reset, the name of a branch means the commit whose ID is stored in that branch name.

Once we've finished all of this we can just delete the names new-branch-b and new-branch-c:

git branch -D new-branch-b
git branch -D new-branch-c

The -D is the forcible delete, which makes Git do it even if Git thinks it's not safe. (Git's idea of when this is safe and when this isn't is, um, a good try, but not great.)

Cherry-pick can have merge conflicts

This isn't particularly likely for your case, but it's important to know for the future. Every git cherry-pick is actually a kind of merge. Git is going to "merge" the changes made in the commit itself—computed by comparing the parent commit to the commit, just like git show compares the two—into the current commit, finding your current commit's changes by comparing the parent commit of the cherry-picked commit to the current (HEAD) commit.

If you are a bit confused here, don't worry: The preceding paragraph is definitely hard to read. It's really best shown by illustration:

       o--o--...--P--C--o--...--o   <-- other-branch
      /
...--o
      \
       o--o--H   <-- your-branch (HEAD)

You run git cherry-pick <hash of C>. Git:

Diffs P vs C: that's what they changed.
Diffs P vs H: that's what you changed, sort of
Combines these two sets of changes, applying the combined changes to the files from P (i.e., repeating "what you changed" just to get back to what's in H, but then adding "what they changed" to get from H to the result).
If the combining works, makes a new commit C'. Otherwise, stops and leaves a mess.

When this works without effort on your part, the effect is that whatever changed from P to C, those same changes are now in the new commit C' that git cherry-pick made that's a copy of commit C:

       o--o--...--P--C--o--...--o   <-- other-branch
      /
...--o
      \
       o--o--H--C'  <-- your-branch (HEAD)

When it goes wrong, Git stops with a merge conflict, the same way it stops in git merge when something goes wrong. At that point it's your job to complete the "merge"—the cherry-pick, in this case—and then run git commit or git cherry-pick --continue to finish the job. You can use all the same tools that you would during git merge, to finish the job, so whatever you like for git merge, use the same method.