Building an incremental list of revisions in a git branch

Question

I want to build a list of all revisions that are in a branch (due to having to regularly check things against them). So basically, this is a cache of revisions that branch has. Due to the massive size of the branch, it'd be ideal to incrementally update the cache with only the new commits since the last time we updated the cache. This works well since I incidentally have a way to know when a branch is "dirty".

I can fetch all revisions in chronological order (oldest first) with git rev-list --reverse my-branch. It gets me a nice and easy list of revisions that I can fill into my cache. I then seem to be able to find the new commits since with git rev-list --reverse my-branch ^<revision>.

Thing is, I note that if I then run my first command (git rev-list --reverse my-branch) again, I get a different result. The same commits seem to be there, but the order is different. Which makes me wonder if my approach described in the paragraph above is really sufficient. I don't actually care about order; I just want a complete set of revisions in that branch. The only thing I require order for is to know what commit is the last one I had (so I can fill in the <revision> in my second command). I make the assumption that the last commit in the previous list is the most recent.

(I actually ask this in part because I've used such a system for a while, but now I have revisions missing in the cache and wonder if my method of building such a cache is insufficient.)

score 1 · Accepted Answer · answered Dec 07 '17 at 05:10

The main issue is defining the phrase revisions in that branch.

Depending on how your branch grows, it may suffice to use git rev-list --topo-order --reverse ^stop start to get a list of commits that are reachable from the name or hash-ID or other starting point start, but not reachable from the name or hash-ID or other starting-point stop. Then, having done that, you can update the saved hash ID to the hash ID you gave as, or obtained from, start.

Long description

Many people like to imagine Git branches working something like this graphic:

master:  A--B--C
                \
develop:         D--E

Here there are five commits in the repository, and they think of the first three commits—we label commits A through Z rather than by big, ugly, incomprehensible hash IDs here—as "belonging to" branch master, with commits D and E "belonging to" branch develop.

But that's not how Git branches actually work. The commits do have internal arrows connecting them, but these arrows are all backwards. They start at the right and work left. These internal arrows come out of each commit and point back to the commit's parent (or for a merge commit, two or even more parents). In fact, rather than arrows, each commit stores the raw hash ID of its parent (or parents, in case of merge commits). The pointers are thus attached to the children—or more accurately, embedded within them and a permanent and unchangeable part of their identity.

(The actual raw hash ID of each commit is determined by computing a cryptographic hash of the contents of the commit, including the spelled-out parent hash or hashes. This is what makes it impossible to change anything about any commit, ever: if you change even a single bit, the result is a new, different hash, for a new, different commit.)

Meanwhile, names like master and develop serve as moveable arrows, pointing to one specific commit. So the drawing really should look like this:

A--B--C   <-- master
       \
        D--E   <-- develop

The name master points to commit C, and the name develop points to commit E. Commit E points back to D; D points back to C; C points to B; and B points to A. Since commit A is the very first commit ever made, there's nowhere for it to point—so it doesn't, which makes it a root commit.

In the end, this means that all the commits (in this five-commit repository) are on develop; three of those are also on master.

How branches grow, part 1

Now, the typical process for adding commits to a branch is:

git checkout <name>
... do some work ...
git add -u   # or similar, to copy new versions back into Git's index
git commit

The first step, git checkout name, extracts the contents of the commit to which the given branch name name points. These contents go into Git's index, and also into your work-tree. Git then sets the name HEAD to record the name name. (Let's say name is develop and we're in this five-commit repository.)

You now do your work as usual, then use git add to copy updated files back into the index. Many people think the index is empty until git add-ing to it, but that's not the case either. (Git's --allow-empty flag is rather misleading. It's not whether the index itself is empty, but rather whether the diff from HEAD to the index is empty.)

The index is a complicated beastie, and hard to see directly, but the best short description is that it's where you build up the next commit to make. It starts out with all the same files as the work-tree (but in Git-ized internal form), matching the HEAD commit you just checked out. You alter these internal-form copies by using git add to copy new versions from the work-tree. The git commit command then packages up the index contents as a new source snapshot, collects a commit message from you, and writes out a new commit that has:

the snapshot made from the index (technically a tree line);
your name and email address as author and committer, and "now" as the time-stamps for these two items;
the current commit's hash ID as the new commit's parent; and
your commit message as the new commit's message.

You can see these by running git cat-file -p HEAD (try it!).

Having written this out, there is now a sixth commit:

A--B--C   <-- master
       \
        D--E   <-- develop (HEAD)
            \
             F

The new commit points back to the current commit. The final step in making this commit appear on the branch is to move the branch pointer, by writing the new commit's hash ID into the branch whose name is stored in HEAD. Since that's develop, the result is:

A--B--C   <-- master
       \
        D--E
            \
             F   <-- develop (HEAD)

(and now there's no reason to put F on a separate line from E; I just kept it that way to make it more obvious what's happening).

How branches grow, part 2

Now, branches need not grow so simply. For instance, suppose we have our six commits A through F so far. We then run git checkout master and create a new commit:

A--B--C------G   <-- master (HEAD)
       \
        D--E--F   <-- develop

and then, having done that, we run git merge develop.

Git will now compare commit C (the merge base of the two branches) to both tip commits—HEAD names commit G and develop names commit F, so Git runs git diff --find-renames C G to see what we did, and git diff --find-renames C F to see what they (whoever they are) did on develop.

Git now combines these two sets of changes and applies the combined changes to commit C. If all goes well—if the changes don't seem to conflict, at least as far as Git is smart, which is not very far at all—Git will make a new commit from the result. This new commit has not one but two parents, and we can draw it like this:

A--B--C------G--H   <-- master (HEAD)
       \       /
        D--E--F   <-- develop

At this point, suddenly commits D-E-F are all on master. They are reachable from commit H, to which the name master points.

This is the first secret to `git log` and `git rev-list`

Both git log and git rev-list work by finding some starting point—some first (or last, really) commit, usually the tip of some branch. You can specify any one particular commit, by giving a branch name, or by its raw hash ID, or by any of a huge number of other special syntaxes¹ (these are listed in the gitrevisions documentation), as a starting point, and the command will use that commit to find a parent commit, and the use the parent to find another parent, and so on.

The git log command defaults to looking at HEAD, while git rev-list, which is aimed at scripts, has no default: you must explicitly name HEAD if that's what you want. In this case, if we start the commands with commit H, they will look at H (printing out its hash ID and perhaps some other information about it), then look at its parent.

But commit H has two parents, not just one. So git log or git rev-list will now look at both commits, G and F, "simultaneously".

They can't actually show you both simultaneously, so they linearize the listing. The exact linearization method depends on the sorting options you specify. The default, whenever there is more than one commit in the queue to be shown, is to show whichever commit has the latest committer date, but if you specify --topo-order, the command will be sure not to interleave two different sub-branches: if it goes next to commit F, it will go all the way down to D before showing G.

(You might wonder how Git could pick F next instead of G. Well, we're assuming for the moment that G was made later, so it wouldn't—but what if the computer clock was wrong when we made one of them? Or what if G was made first, and we've just labeled it weirdly?)

Since every commit is reachable from H (by starting there and working backwards along both forks), git log will show every commit, by default. To make it stop early, you can specify a stopping point: it will avoid showing that commit and any commit reachable from that commit, in the same fashion. So if we tell it not to show commit E, it won't show E, nor will it show D, nor any of C or B or A either. This won't stop it from showing G though: G is not reachable from E. Reachability requires going backwards, through the backwards links that Git stores.

Adding --reverse simply tells the command to output the final list in reverse order (which, since the natural order is already backwards, reverses the backwards-ness into forwards-ness). Git still has to generate the list backwards, though: there is no easy way to go from a commit to its children. Commits know all their parents, but no commit knows any of its children.

¹Not "syntagma", although I like that word, which is a real word.

Sometimes, branches are updated violently / forcefully

We can have all this perfectly normal, natural, one commit at a time growth, or even this sudden "name acquires a lot of new reachable commits along with all the ones it had earlier" from merging (or what Git calls fast forwarding), but we can also have wrenching changes.

Suppose, for instance, after merging develop into master, we remove develop entirely:

A--B--C------G--H   <-- master (HEAD)
       \       /
        D--E--F

None of the commits go away at all, because they are all find-able (reachable) from master. But now we can make a new develop, unrelated to the old one. Let's arbitrarily start it at commit G:

              G   <-- develop
             / \
A--B--C-----'   H   <-- master (HEAD)
       \       /
        D--E--F

and add a new commit:

              G--I   <-- develop
             / \
A--B--C-----'   H   <-- master (HEAD)
       \       /
        D--E--F

and maybe relax our drawing a bit:

               I   <-- develop
              /
A--B--C------G--H   <-- master (HEAD)
       \       /
        D--E--F

If we start from this new develop and work backwards, then reverse the list to go forwards, we get commits A--B--C--G--I. Commits D--E--F are no longer in the list at all!

More commonly, but still rather wrenching, we can have "force push" events that deliberately discard commits across a push from one repository to another, or git reset events that discard commits within a repository. In these cases, an old stop-point may become invalid, or at least, not very useful. It's up to whoever is defining what it means to select commits that are "on a branch" to determine what to do here.

First parents

In all cases, it's worth thinking about merges, which bring many commits into reachability all at once, and what that means for your task. There's a very important feature of git merge, though, that can be helpful, provided that everyone who runs git merge does so in a properly disciplined manner. This is the first parent notion.

When we did our merge that created commit H, above, we were on the branch named master (the name HEAD contained ref: refs/heads/master, and git status said on branch master). So Git makes sure that the first parent of commit H is commit G, and the second parent of commit H is commit F—the commit to which the name develop pointed, at that time.

If we use this first parent notion, we can follow from commit H back to commit G without having Git follow H back to F as well. Then G leads back to C, which leads cleanly to A; so our reversed list will be A--B--C--G--H, excluding the merged-in D--E--F entirely.

To get this behavior, simply add --first-parent to your git rev-list or git log command. But note that it depends on this: whoever did the merge, that brought in commit F and hence the whole D--E--F chain, must have done it properly. If users carelessly use git pull,² they will create what some call foxtrot merges, which put the main line commits in as the second parent instead of the first.

²(in Zathras voice) git pull ... is wrong tool. ... Never use this.