0

I'm trying to solve a problem with my git history.

I have two branches, lets call them a and b, a is the source branch for my repository, and b was branched off it. Many merged pull requests (a mix of merge commits, and squashed commits) happen to a, and b is not rebased, neither are these PRs re-opened against b. In order to update b, I (naively) resort to cherry-picking the merge commits from a that are applicable to b.

Over time as the code is tested and approved in a, b is now equal to a, with an empty diff, but the history shows, (as expected) a divergence in commits. (b is seen as many commits ahead of a in Github).

I would like to eliminate this divergence, if I perform merge commits from either one into the other, I get a long string of commits with no changes (except in some cases on Github where it gets confused), which would pollute the history of either branch if allowed onto them.

e.g. git merge a from b or git merge b from a shows a long string of commits in the log. (Edit: this is with --no-ff).

If I perform a squash, I get a nice message listing those commits with no changes.

e.g. git merge a --squash from b or git merge b --squash from a requires git commit --allow-empty.

But that does not prevent the "this branch is ahead/behind" situation.

So my question is how do I produce a single commit (can be in either or both branches) to prevent the "ahead/behind" situation.

Edit: This is mainly for ease of tracing whats happened on Github, I know how git works and know how to do this by destroying the history of one or the other branches by force pushing, but I'm trying to preserve both in a way that is easily introspectable.

Jon Rowe
  • 336
  • 1
  • 6

3 Answers3

0

The only real answer here is that you don't.

What Git has—what it stores—is a graph of commits. Each commit has its own unique hash ID, never to be used by any other commit in this or any other repository.1 Each commit also points (backwards) to its parent commit(s); this is what forms the graph.

That is, given a linear chain of commits, using uppercase letters to stand in for their actual big ugly hash IDs, we can draw them like this:

... <-F <-G <-H ...

The commit whose hash is H contains the hash ID of commit G, so that H points to G. The commit whose hash is G contains the hash ID of commit F: G points to F. This goes on all the way back to an initial commit, that doesn't point backwards any further. (Usually there is just one such commit, the root commit, though Git permits multiple root commits.)

A branch name like master or develop or whatever just points to one specific commit. That commit is the last commit in that branch. In situations like this:

          I--J   <-- branch1
         /
...--G--H
         \
          K--L   <-- branch2

we need two branch names to find all the commits, because Git works by having a branch name point to the last commit that we'll call "part of the branch". Commit J is the last commit on branch1. J points back to I, which points back to H. Meanwhile the name branch2 points to commit L, which points back to K, which points back to H. Note that commit H is on both branches (as are commits G and earlier).

If H has a name pointing to it:

          I--J   <-- branch1
         /
...--G--H   <-- master
         \
          K--L   <-- branch2

that's fine: commit H i the last commit of branch master, and H is on all three branches.

Commits are called reachable from some point if, by following the internal backwards-pointing arrows, we can walk from that point (whatever it is) to the commit we're calling reachable. So H is reachable from all three branches, but I and J are reachable only from branch1.

When Git says that one branch name is "ahead of" another, this is because there are commits that are only reached from that one branch name. Hence branch1 is 2 commits ahead of master, and two commits ahead of branch2. master is not at all ahead of either of the other names, but branch2 is 2 commits ahead of both master and branch1.

A regular git merge operation combines changes and makes a new commit, and the new commit points back to both earlier commits:

          I--J
         /    \
...--G--H      M   <-- branch1 (HEAD)
         \    /
          K--L   <-- branch2

The multiple backwards arrows coming out of M, pointing to both J and L, mean that branch1 is now three commits ahead of branch2: from branch1 we can reach commits M, J, and I, as well as L and K and of course H and earlier. From branch2, we can reach L and K and H and earlier, but we cannot go forward from H to I.

Using git merge --squash, Git will make a new commit with the same content as the merge M, but instead of two backwards arrows, you get just one:

          I--J
         /    \
...--G--H      M   <-- branch1 (HEAD)
         \
          K--L   <-- branch2

It is no longer possible to move from M to L, so branch2 remains two commits "ahead of" branch1.

Note, however, that commit M has anything good from K and L in it. So what you can do is simply delete the name branch2 entirely:

...--G--H--I--J--M   <-- branch1 (HEAD)
         \
          K--L   [abandoned]

If there is no name by which you can find commit L, you'll never see it again. Since commit L was the method you used to find commit K, you will never see that one again either. They remain in your Git repository for a while—exactly how long is hard to predict as it depends on both any hidden (reflog-only) names that still find commit L, and on how soon the maintenance git gc command get around to sweeping up unused commits and really removing them. But deleting the name branch2 deletes the branch2 reflog, so the only reflog that probably still remembers L's raw hash ID is that for HEAD.

(If you have other branch and/or tag names that remember commits L and/or K, they'll still be findable that way, and won't ever be collected by the git gc garbage collector.)

Usually, after git merge --squash, the right thing to do is to delete the other branch name, and totally forget that those commits ever existed.


1Technically, any Git repositories that never "meet" could re-use IDs from each other harmlessly. In practice, they just don't anyway. The chance of one Git object having the same hash ID as some Git object is, on its own, one in 2160. Because of the Birthday paradox, this chance rises pretty fast as the number of objects increases, but it's still on the order of 1 in 10-17 when you have fewer than many trillions of objects. See also Git hash duplicates.

torek
  • 448,244
  • 59
  • 642
  • 775
0

resort to cherry-picking […] Over time as the code is tested and approved in a, b is now equal to a, with an empty diff, but the history shows, (as expected) a divergence in commits. I would like to eliminate this divergence, if I perform merge commits from either one into the other, I get a long string of commits […], which would pollute the history

You either have different commits in the histories, or the same ones. You can't have it both ways.

But you can easily not list the merged commits. git log --first-parent shows you only the commits on your main ancestry, without listing any merged-in history details. This is probably what you want. Folding a long string of commits into a single summary-merge commit, which can then be listed alone with --first-parent, is a useful trick. Hunt up --no-ff merges, they're how you do recorded summary merges, putting the merged history on a second-parent branch, rather than simply fast-forwarding over the whole list, keeping it as part of the main line.

jthill
  • 55,082
  • 5
  • 77
  • 137
  • Yeah I realise the cherry-picks have caused the problem, and its the `--no-ff` is the merge strategy thats producing the spurious commits, the problem is merging with the "normal" (`--no-ff`) strategy looks ok in the logs, looks terrible on Github which is one of our primary git tools. – Jon Rowe Feb 19 '20 at 22:57
  • 1
    I'd recommend any of (a) not using GitHub as the primary tool, (b) encourage GitHub to provide a "view with `--first-parent`" option, and/or (c) start a competitor to GitHub (they have a few now, with both Atlassian and GitLab out there...). – torek Feb 19 '20 at 23:09
  • So, the proper title on your question is, "how do I display just the first-parent history on GitHub"? – jthill Feb 20 '20 at 00:58
  • btw, the cherry-picks haven't caused any problems here. What's causing problems would be fixed by you using any of torek's recommendations, i.e. finding or making some way to use the `--first-parent` display on your web server. That's what's causing the problem: a remarkably unhelpful corner in existing GUI history displays. Except the one that comes with Git. That will display your first-parent histories. – jthill Feb 20 '20 at 16:06
0

My solution was to recreate the branch again.

Creating a new branch of a and cherry-picking the merge commits from b but using:

git cherry-pick -m 1 --allow-empty --keep-redundant-commits <sha>

Along with resolving any conflicts in favour of HEAD and using git commit --allow-empty.

I ended up a clean version of the history I cared about in b (which was the dates of the merges of various PRs), which was based off a. This allows proper merges from b into a to keep it up to date, and visa versa.

Jon Rowe
  • 336
  • 1
  • 6