How can I rebase with only commits that affect certain files?

Question

Let's say I have 100 commits in a repository, and I want to rebase only the commits that affect three files: foo, bar, baz. How can I do this?

Obviously theoretically those commits could affect other files, in my case I doubt that's the case, but I'm ok if they do.

Are you asking how to rebase a branch and include only commits that touch those files, or somehow rebase only those commits, leaving the others untouched? — bk2204, Sep 18 '20 at 21:28

score 0 · Answer 1 · answered Sep 18 '20 at 22:48

There's not quite enough information in your question to provide a recipe, but on the other hand, I specialize in teaching how to cook, not just how to follow a recipe.

We start with the ingredients: a repository is made up of commits. You've proposed that there are 100 commits. But that's not enough: we need to know how these commits are arranged.

Remember that a commit is itself made up of two parts. The two parts cannot be divided—or to put it another way, you can break them up, but as soon as you do so, what you have is not a commit any more. The two parts to each commit are:

a snapshot: all the files that Git knew about, at the time you (or whoever) made the commit, frozen in time in the form they had at that point; and
some metadata: information such as the name and email address of the person who made that commit.

Each commit is numbered, with a unique and random-looking hash ID. The number isn't actually random—it's really a cryptographic checksum of the contents of the commit (both parts)—but it's unpredictable, so we, or Git, just have to know the number to find the commit. A repository contains all of its commits (and other Git objects) in a big database: a key-value store in which the keys are these hash IDs. Given the hash ID key for a commit, Git looks up the value in the database, and gets both the snapshot and the metadata.¹ Note that because the hash ID of a commit is a checksum of its data, not even Git itself can change this data. Any commit, once made, is frozen for all time. It has taken up its unique number: no other commit can have this number.²

The snapshot is just that. It's not a set of changes! To get changes, Git has to compare two snapshots, and play a game of spot the difference. Whatever is different between the two snapshots, that's what Git tells you about.

The metadata, on the other hand, is a mix of stuff-for-you—such as the name of the commit's author and committer (possibly two different people)—and stuff that Git needs itself. One of these for-Git parts is that each commit stores the hash ID of the commit that comes before this commit, which Git calls the parent of the commit. By chaining commits together like this—backwards—Git achieves something remarkable.

¹Technically, the snapshot itself is stored as a tree object, and the commit object's metadata just includes the hash ID key for the tree object. This means that if two different commits store the same tree, there's only one actual snapshot. This is the same de-duplication technique that Git uses for individual files (blob objects in the big database).

²This is why commit hash IDs are so big. The 160 bits in an SHA-1 hash generally reduce the chance of accidental hash collision well below the chance of undetected storage-media errors even for repositories with millions of commits. Deliberate, or calculated, collisions are another matter, though.

How commit chains become branches

Let's draw a bit of a small chain, using uppercase letters to stand in for each commit:

... <-F <-G <-H

If we know the hash ID H of the last commit in the chain, we can have Git look up that commit for us. That gets a snapshot, and the metadata that give us (or Git) the hash ID of earlier commit G. So now Git can find commit G, and compare the two snapshots, to tell us what changed going from G to H.

Using the metadata for G, Git can now find the hash ID of earlier commit F. So Git can show us commit G, then move back to F. Meanwhile, F contains the hash ID of its parent. So this just repeats, over and over, until we get back to the very first commit ever.

But there's still one hitch: How do we know the hash ID for commit H? This is where branch names enter our picture. A branch name in Git simply holds the hash ID of the last commit in the chain:

...--F--G--H   <-- master

Now, suppose that at this point, we make a new name topic that also contains the hash ID H:

...--F--G--H   <-- master, topic

All the commits are now on both branches, and no matter which name we use, we're using commit H itself. But we need to know which name we're using, so we have Git attach the special name HEAD to one of these two names:

...--F--G--H   <-- master (HEAD), topic

If we now run git checkout topic or git switch topic, we're telling Git: we'd like to use the name topic now, so Git will snip the HEAD off master and attach it to topic instead. If these two names identified different commits—they will in a moment—Git would have to do more work, but right now, that's all it has to do:

...--F--G--H   <-- master, topic (HEAD)

Now we make some changes to some files, use git add, and run git commit. Git makes a new commit—by a process whose details we'll ignore here—which gets a new, unique, big ugly hash ID; we'll call this commit I. The parent of new commit I will be existing commit H, because that's the commit that the name topic identifies right now. So we get:

...--F--G--H   <-- master
            \
             I

The tricky bit here is that as the last step of git commit, Git writes I's hash ID into the name topic (because that's the name to which HEAD is attached). So instead of pointing to commit H, the name topic now points to commit I:

...--F--G--H   <-- master
            \
             I   <-- topic (HEAD)

HEAD is still attached to topic, but now the current commit is commit I: the one we just made. If we make a second new commit, and call it J, we can draw it like this. I'll move this to above master for a reason that's not obvious yet:

             I--J   <-- topic (HEAD)
            /
...--F--G--H   <-- master

Now let's git checkout master or git switch master to get HEAD attached to master, and to make H the current commit. We'll see the stuff we did in topic disappear (it's still in the repository, it just is not in our work-tree area any more) and we now have:

             I--J   <-- topic
            /
...--F--G--H   <-- master (HEAD)

Now let's make some new commits on master, and for no obvious reason, draw them like this:

             I--J   <-- topic
            /
...--F--G--H
            \
             K--L   <-- master (HEAD)

If we like, we could make a new branch name first, so that we end up with:

             I--J   <-- topic
            /
...--F--G--H   <-- master
            \
             K--L   <-- topic2 (HEAD)

instead.

The key concept here is that the branch name always identifies the last commit. When we use git checkout or git switch to pick a branch name, Git gets that commit out as the current commit and makes that name the current name. The special name HEAD works to keep track of both of these, because the name HEAD is now attached to the branch name, and the branch name selects the right commit. When we make a new commit, Git updates the branch name by using HEAD to find the name, and once again, the branch name identifies the last commit.

Copying a commit

Suppose we have some set of commits that we might draw like this:

...--G--H-----K   <-- main
         \
          I--J   <-- topic

When we made J, we did so by fixing some major bug, so we'd like to get commit J's changes into the main branch. But commit J itself uses commit I as a starting point, and commit I introduces a new feature that's not ready for release yet. So we just want to copy J's change.

We already know that Git can and does use the parent links to show us changes. That is, Git can easily compare the snapshot in J vs the snapshot in its parent I. This is what we'd like to do to the snapshot in commit K, too.

The Git command that gets these changes from some other commit, and puts them where we are now, is git cherry-pick. So to make a fix to commit K as a new commit, we will:

git checkout main
git cherry-pick topic

The first step will make K the current commit, attaching HEAD to the name main. The second one will compare the snapshot in I to that in J, and apply those changes to the current commit.³ Then—provided all went well, anyway—git cherry-pick will make a new commit from the result of applying those changes:

...--G--H-----K--J'  <-- main (HEAD)
         \
          I--J   <-- topic

I've called this commit J' because it is, in a useful sense, a copy of commit J. Git even re-uses the commit message from commit J here. The new commit has some different date-and-time stamps, and a different parent hash ID—commit J' points back to commit K, not to commit I—so it has a different hash ID from original commit J, but it's reasonable to call it a copy of J.

³Technically, Git uses the full power of the git merge machinery to do this. We'll ignore this fact here.

Rebasing is mainly repeated copying

Now that we know how git cherry-pick works, consider what happens when we have this:

...--H-----K   <-- main
      \
       I--J   <-- feature (HEAD)

Here, we made commits I and J to work on some new feature. While we were doing that, someone discovered a dreadful bug in the mainline, and put in a quick fix via commit K. We'd like to use that same fix.

We could copy commit K to a new commit K', but we've just started this new feature, and nobody else has commits I-J yet. Instead of copying K, what if we copy I to a new I' and J to a new J', with these two new copies coming after K, so that they're based on the fixed main at commit K, instead of the broken one at commit H? That is, we'd like to get:

             I'-J'  <-- feature (HEAD)
            /
...--H-----K   <-- main
      \
       I--J   [abandoned]

It's safe for us to leave the original I-J behind, because nobody else has them: they are only in our repository. Since we did not give them to anyone else, nobody else could possibly have them. Anyone who made something similar, got a different hash ID, because their name and/or email address will be different, and they will have made their commit with a different time-to-the-exact-second.

The command that does this for us is git rebase. It:

lists out hash IDs of commits to copy;
uses Git's internal detached HEAD mode to copy those commits, one by one;
yanks the original branch name off the last commit of the original branch, and makes it point to the last commit it just copied; and
re-attaches HEAD to the now-moved, now-made-up-of-copies branch.

It does all of this as one apparently-seamless thing, provided everything goes well and we don't use git rebase -i and tell it to stop. If things go badly, or we use git rebase -i and use instructions like edit, we get to see the effects of the individual steps.

Back to your question

I want to rebase only the commits that affect three files ...

Your first step, or at least an early one, will be to list out the commits you'd like to copy. You can use git log -- path/to/file1 path/to/file2, for instance, to find commits where, between the commit's parent and that particular commit, Git sees those paths as being modified.

Your second step (well, probably second) will be to decide what to do about the fact that you're making copies, but only of some commits. Where should you put these copies? What should you do with the original chain of commits? Suppose your graph looks like this:

...--G--H--I--J   <-- master
      \
       K--L--M--N--O   <-- feature

Suppose further that commits L and N are the two that you'd like to copy.

You must decide where the copies go. Do they go after J? Or do they go after G, or after some other commit? Pick a place and create a branch name there, or use Git's detached-HEAD mode (but if things go wrong, having a branch name handy already is nice, so I'd generally recommend using a branch name). Create a new branch that identifies the chosen commit. If that's commit G, for instance:

git checkout -b new-branch <hash-of-G>

will do the trick, giving you:

       H--I--J   <-- master
      /
...--G   <-- new-branch (HEAD)
      \
       K--L--M--N--O   <-- feature

You can now run git cherry-pick hash-of-L hash-of-N to get:

       H--I--J   <-- master
      /
...--G--L'-N'  <-- new-branch (HEAD)
      \
       K--L--M--N--O   <-- feature

Now decide what is to become of the name feature. Should this retain commits L and N? Or do you want to copy M to an M' whose parent is K, then copy O to an O' whose parent is M'? The latter would give you:

       H--I--J   <-- master
      /
...--G--L'-N'  <-- new-branch (HEAD)
      \
       K--L--M--N--O   ???
        \
         M'-O'   ???

I've left the branch names here out because you must also decide what to do with those. Note that you can construct this new sequence using git rebase -i while on feature, and switching pick commands to drop commands on commits L and N, which you copied earlier.

You might perhaps choose just to abandon the original K-L-M-N-O sequence entirely; in that case, you can delete the branch name feature. Or you could, earlier, have used git rebase -i while on feature, dropped all commits except the ones you want to keep, and let the automation that git rebase -i provides do the copying of the commits you want retained (as new copies).

Conclusion

Git provides tools. You have some existing graph—some chains of commits, ending with commits as pointed-to by various branch names—and you'd like to have a new graph: the same as your original graph in some parts, and different in others. The git branch command, or git checkout -b or git switch -c, lets you create a new name, pointing to some existing commit. The cherry-pick command lets you copy one commit, or several. The rebase command, in interactive mode, lets you copy some selected subset of commits, dropping others, with a final step that forces the branch name to point to the last-copied commit.