How to find pairs/groups of most related commits

Question

I was working on a bigger change in some project in some local branch (for some GitHub PR, specifically this, e.g. this latest (intermediate) branch HEAD) and this resulted in quite a lot of commits:

Now as it is (mostly) ready, I want to clean up those commits, i.e. squash commits together, use nicer commit messages. However, in this specific example, many changes belong together (all commits prefixed with better subnet logic, WIP) and I did not knew a good way to separate things beforehand.

So the original intent was to squash all those better subnet logic ... together.

However, now that the changes have become so big, I was thinking about splitting it up.

These commits also contain intermediate TODOs/discussions/thoughts which are later cleaned up. Sometimes also an attempt at solving some problem, which I delete in some later commit and do it in another way.

Is there a good way (or tool) to reorder these commits and automatically finding pairs (or groups) of commits which would lead to smaller changes when squashed together?

E.g. let's say some commit adds this comment:

# TODO this needs to be fixed
#   maybe can do way A...
#   or maybe way B...?

And then some later commit removes this comment, and insteads it adds some other comment there, or maybe some code, e.g.:

# We use way B because ...
B(...)

I would want to find such commits automatically which belong together, and when squashed together, would clean up the history.

How?

I could also write some own script/tool to do that for me.

The trivial solution is to just squash all, which would lead to the minimal amount of changes. But this trivial solution is not interesting.

You could limit it by amount of commits squashed. E.g. let's assume we exactly want to find two commits out of all the local commits in the branch which reduce the amount of changes most from all two pairs of commits. This should be simple.

Small important extension: Only allow such a commit pair if the commits can be reordered such that they can be squashed without conflict.

Maybe there are better ways?

Does such a tool/script already exist?

Not directly the question, but related, and maybe you have further advise on that:

Unfortunately, it gets more complicated than that, because many of the commits contain multiple unrelated changes (within this big "better subnet logic" topic). So to make this work, I probably need to split the commits up even more.

Maybe my workflow is also not optimal. If you have an advise how to improve my workflow, please don't hesitate to comment or put this into the answer as well. I.e. how to avoid this complicated clean up procedure afterwards.

I asked a related question recently on Reddit: How to cleanup a branch (PR) with huge number of commits.

Albert · Answer 1 · 2021-03-21T15:15:35.797

I actually implemented exactly that now here.

Given some branch, for all commit pairs (commit1, commit2), it will:

Checkout new temp branch, base = parent of commit1.
Cherry-pick commit1. Count changes from git diff base..HEAD -> diff1.
Cherry-pick commit2. Count changes from git diff base..HEAD -> diff2.
rel_diff = diff2 - diff1. If this is negative, it became shorter.

Then sort by rel_diff.

I don't really know whether this is a good algorithm. Also, the current implementation is quite slow.

For my example, I get this result:

Done. Results:
-17 commits: ['7266b196728b90dbd79d1d397ba55426dd72bfc5 (better subnet logic, WIP, some discussion...)', '9607184c1c4d75ed7f5b3d6c2802709491414ac6 (better subnet logic, WIP, extra net rollback, fixes)']
-6 commits: ['9607184c1c4d75ed7f5b3d6c2802709491414ac6 (better subnet logic, WIP, extra net rollback, fixes)', 'bcc8c2599b68f3e6bfa0c9e14fbdaf5714c24a8e (better subnet logic, WIP, cleanup)']
-5 commits: ['d3f22270fed703887896af1224bf28b2e9aea172 (better subnet logic, WIP, more)', 'cf14ef08fd2e8e0d445124247ee50758f04152f0 (better subnet logic, WIP, cleanup)']
-4 commits: ['925c0b4ef391a8aa0acf5cebf1efb429c2f42677 (better subnet logic, WIP)', 'bc1603553e2e589e447d3b73db8b3fd038b9427e (better subnet logic, WIP, cleanup SwitchLayer)']
-2 commits: ['7436240d11e3d63a985c69256864eb8c9b6485b2 (better subnet logic, WIP)', 'f2cd095fb41f89b436f3a527b3ec95076e73d576 (better subnet logic, WIP, cleanup)']
-1 commits: ['d3f22270fed703887896af1224bf28b2e9aea172 (better subnet logic, WIP, more)', 'dcb01cc32366aa54275a4196460adf514e276ef7 (better subnet logic, WIP, cleanup)']

I.e. that tells me, for cleaning up, the best first thing I could do is to squash 7266b196 and 9607184.

score 1 · Answer 2 · answered Sep 01 '21 at 22:44

When you've got that kind of history where you were snapshotting in-the-moment progress and the meaning of each step is unclear, the best way to organize it for presentation is often to rebuild from a clean reset:

git reset `git merge-base @ master`

and now your work tree has the result you want but no history,so construct the history you would have made if you'd known what you were doing all along:

git add -N .  # in case there's new files you added along the way
git commit --patch # commit just the change hunks you want for this step

and repeat until done. You can also use git add --patch and git reset --patch to stage and unstage specific hunks piecemeal before committing, and git diff --cached to have a look at the specific changes you've staged so far for this next commit.

How to find pairs/groups of most related commits

2 Answers2

Linked