1

Consider a repository with a large amount of commits (more than 20 thousand) in a single branch without any merges (a straight chain of commits). I'd like to squash all commits of the same author in a row in a single commit, for all authors, creating a new shorter story. Example:

  • commit 09 - Author BBBB
  • commit 08 - Author BBBB
  • commit 07 - Author AAAA
  • commit 06 - Author AAAA
  • commit 05 - Author AAAA
  • commit 04 - Author CCCC
  • commit 03 - Author CCCC
  • commit 02 - Author AAAA
  • commit 01 - Author BBBB

It'd ended up like:

  • commit 05 - Author BBBB
  • commit 04 - Author AAAA
  • commit 03 - Author CCCC
  • commit 02 - Author AAAA
  • commit 01 - Author BBBB

How to script it with git?

Luciano
  • 2,695
  • 6
  • 38
  • 53
  • 2
    I think there is no suitable way to do so. If you rebase or squash your history beware that a commit depends on its previous commit. Otherwise: create for each author a branch, squash the commits (deleting unneccessary commits), merge this, push to master – nologin Apr 01 '19 at 20:12
  • In other words: you will (most likely) get lots of conflicts. – eftshift0 Apr 01 '19 at 20:54

2 Answers2

4

Based on this answer https://stackoverflow.com/a/46403701/926064, I ended up with this one. It really worked like a charm:

$ GIT_EDITOR='cat' \
GIT_SEQUENCE_EDITOR='todofile=$1; awk '"'"'{if ($1 != "#" && $1 != "") { author=$3; if (lastauthor != author) { lastauthor=author; printf "pick %s %s\n", $2, $3 } else { printf "squash %s %s\n", $2, $3 }}}'"'"' $todofile>$todofile.temp; mv -f $todofile.temp $todofile; cat $todofile' \
git -c "rebase.instructionFormat=%ae" rebase -i $(git log --oneline --reverse --pretty=format:%H  | head -n1)

Notes:

The first one, GIT_EDITOR ensures that the squash commit message would be preserved like the default git squash messages, don't touching them - they will be concatenated messages.

The second one, GIT_SEQUENCE_EDITOR will do the desired work, the filter, saying which commit will be squashed based on the author. But it depends of the author's email, so when we call git rebase we must format the "rebase instructions" asking git to put the author's email on the list.

The third and last one is the git rebase, but we have to format the "rebase instruction" to put on them every information we will need when processing (editing) the rebase instructions list.

Just for convenience below is the formatted awk script inlined in GIT_SEQUENCE_EDITOR variable:

{ 
    if ($1 != "#" && $1 != "") { 
        author=$3; 
        if (lastauthor != author) { 
            lastauthor=author; 
            printf "pick %s %s\n", $2, $3 
        } else {
            printf "squash %s %s\n", $2, $3
        }
    }
}
Luciano
  • 2,695
  • 6
  • 38
  • 53
  • +1. This question thread ended up being suggested as a duplicate of this new question: https://stackoverflow.com/q/73351711/1271772 What are your thoughts on it? – Nike Aug 17 '22 at 22:30
  • this is a great solution. how would you handle it if there are merge commits though? @Luciano – SatheeshJM Jul 19 '23 at 16:46
  • @SatheeshJM, my commits were linear, there wasn't any merge. So, I didn't have to deal with it. But, I wonder how people would do about it. Perhaps, jumping (squashing) the whole fork/merge? Idk – Luciano Jul 22 '23 at 11:28
2

There is no built in way to do this.

As nologin effectively noted in a comment, if you achieve the desired set of commits, you have a new history, incompatible with the original history. If that's OK, there is a process—not built in, but not extremely difficult—by which you can achieve the desired set of commits. First, though, be sure about what you want.

Edit: the rest of this applies only to the question as originally phrased. The caveat below does not apply to the updated question, which now says that the commits are in fact linear. See Luciano's answer for a nice way to use git rebase -i with a few simple tools to achieve the desired result.

ou describe commits as linear, and they might actually be linear, but they might not. They will be linear in some areas. But commits form a Directed Acyclic Graph or DAG. This graph is the history in the repository. In those parts where it is linear, it's pretty simple:

... <-F <-G <-H   <-- master

Here, the branch name master identifies, or points to, commit H. More precisely, the name master stores the hash ID of commit H. Commit H, meanwhile, stores the hash ID of H's parent commit G, which stores the hash ID of its parent F, and so on. By starting at the end and working backwards, git log shows you these commits, and that is the history.

Some commits, however, are merge commits. Such a commit has two (or more, but usually just two) parents. We can draw them this way:

       I--J
      /    \
...--H      M   <-- dev
      \    /
       K--L

Here the branch name dev points to commit M, but M points back to both J and L. J points back to I; L points back to K; and I and K both point back to the commit from which the two sub-branches within the branch formed, namely commit H (to which the name master presumably points: commits H and earlier are on both master and dev).

If commits I, L, and M are all made by author BBBB, but J and K are by author AAAA, what do you intend to do here? If you keep M (by BBBB), and keep J because it's by a different author AAAA, you must also keep L even though it's by BBBB. However, if all of I-J and K-L and M are by AAAA, you might choose to collapse them all into a single commit whose parent is H:

...--H--M'  <-- dev

So it's your job to figure out which commits you want to keep, and what you want to do about merge commits. You must keep merge commits if you need to keep the structure (the fork-and-merge at H and M). If you want to eliminate the branch-and-merge structure, you must discard merge commits, but then you must figure out what to do with oddball commits like I and L if they're by some other author.

Whatever you decide, when you're finally done, the way to achieve the result you want is:

  • Start with a list of all commits (by hash ID) that you wish to retain and/or all commits that you wish to discard. (Either suffices, since we'll assume that you're going to hold the universe of All Commits steady while you do this—i.e., not add new commits to the repository while you're computing these lists and making changes to the repository.)

  • Then run git filter-branch. Choose at least the --commit-filter. You may want additional filters, depending on what other history-data you're intent on discarding here. (For instance, each commit has a log message: do you want to combine all the log messages, or throw away the ones from commits whose snapshot you're throwing away? That is what you're doing: you're producing a fictional history. You can make up as much of it as you like, keeping only whatever you like from the original history, discarding the rest. What you keep and what you discard is up to you. Your new repository is incompatible with the old repositories: changing even a single bit anywhere in history renders the remaining history invalid and incompatible. So you might as well go as far as you like: it's really all-or-nothing!)

    In your commit filter—read the git filter-branch documentation for details—use skip_commit to skip the commits you don't want and git commit-tree "$@" to make the commits that you wish to keep. To decide, just see if $GIT_COMMIT is in the keep or discard list.

The filter-branch command will take care of enumerating each commit, one at a time, in the correct order so that you can emit or exclude the commit from the history you're creating as you go. After it's invoked your commit filter on each such commit, it will write the hash ID of the last copied commit into the hash name. The original history is now effectively gone (but still findable via the refs/original/refs/head/branch name; this name won't be in any new clones, and you can discard it when you're ready; again, see the documentation).

torek
  • 448,244
  • 59
  • 642
  • 775
  • I'm not sure if this answers the question, but recently someone's suggested this thread as a duplicate of this one: https://stackoverflow.com/q/73351711/1271772. Do you happen to know the answer to that question? – Nike Aug 17 '22 at 22:29
  • 1
    [Luciano's answer](https://stackoverflow.com/a/55481779/1256452) works for simple graphs, and does in fact show how to do what's in your linked question. The comments to this question, plus my answer above, show how the existing accepted answer only works for some simple cases (though this may well suffice for your own case). – torek Aug 18 '22 at 07:39
  • It does answer the question, as well. However the original question doesn't mention non-existence of merges, and in fact it was a straight chain of commits. I am adjusting the question stating there is no merges in the given case. – Luciano Aug 25 '22 at 15:50