how to detect a potentially evil git commit (containing the inverse of some other commit)?

Question

I have several times come across the following situation: Based on A, I push some commit B. Someone else pulls the commit B, but for some reason, he does not revert the buffer of his editor, and thus continues editing A. In the end, he pushes a commit C which reverts B's changes, and introduces some new ones. In this way, the effect of my commit B is lost. Often, I have noticed the problem only by chance (and then I could easily reintroduce B by cherry-picking it). However, I am worried that sometimes I did not notice it at all. Is there an easy way of detecting (the potential of) such a situation? (I am happy with obtaining some false positives.) One somehow needs to detect whether the inverse of B is contained in some other commit.

I have created a test repo that shows such a commit: https://github.com/tillmo/test

This question is related to, but different from How do you detect an evil merge in git?. Note that there, the evil commit is a merge commit, while here it is not.

Sounds like there is a flaw with the way the editors your team is using are configured. I've never come across this issue. — Kenyon, Dec 19 '16 at 21:33
Sure this is an editor problem that should be ultimately solved by some better configuration. However, after having encountered it several times with different, I want to be able to detect it. — tillmo, Dec 19 '16 at 22:25
I don't have a proper answer (yet), but I'd say an iteration over all commits and computing the intersection of their reverse diff (`commit..commit^`) to the diff from its parent (`commit^^..commit^) could find at least most of those cases. — André Sassi, Dec 19 '16 at 23:11
Work in branches and require a pull request that needs review / approval before merging. — crashmstr, Dec 19 '16 at 23:33
@AndréSassi: that's what I proposed, although I like your suggestion to just compute the reverse delta for the initial patch. — torek, Dec 19 '16 at 23:35
Not an answer, because it's not fleshed out enough: What you are looking for is files that are changed into a state that they had before. So, you "simply" need to look for files that are changed by a commit, and which get a SHA1 that they already had in history. I'm not sure what `git` incantations are required to arrive at such a SHA1 check. But it certainly seems feasible to do, and it'll certainly outperform any diff-based approach both in performance and in correctness. — cmaster - reinstate monica, Dec 19 '16 at 23:50
This is a workflow problem that code review should be addressing. — Pockets, Dec 20 '16 at 05:37

score 2 · Answer 1 · edited May 23 '17 at 12:24

It's clearly impossible in general, but it may be easy enough to hit enough cases to make it worthwhile. The following is basically pure theory: there's nothing built in to Git to do this.

Think of commits as snapshots that can be converted to deltas (because they are). Your task is, given a sequence of commits:

...--A--B--C--D--...

(we'll somewhat arbitrarily limit this to linear commits—the mods to the algebra for doing merges are obvious, albeit messy) we're going to compute ∆B, ∆C, and ∆D. ∆B is just B-A, ∆C is C-B, and so on. Of course the result of "subtraction" is a diff-set (or changeset) rather than a simple number. These deltas, however, are what you see when you run git show (or git diff). (You may also want to avoid rename detection here if you try to do this for real and make it robust.)

Next, we want to see whether ∆C "includes" -∆B, or whether ∆D "includes" -∆B. A normal git revert is the negative delta, so we're looking for a changeset that isn't a revert itself, yet still contains a revert.

When we look at the raw form of git diff output, we find that, for strictly modified files, the diff just a lot of boilerplate-y @@ and hunk-context lines around the change, which expressed exclusively as - and + lines: remove old, insert new.

Reversion consists of making the same change in reverse: add the removed lines, while removing the added lines, with the same context, in the same order.

A change that contains a reversion is, at least for the easily detected cases, simply a change that includes the reversion—i.e., adds back removed lines while removing added lines, with the same context, in the same order—while also making some other separate change.

In other words, if ∆C = -∆B+∂C, or ∆D = -∆B+∂D, then we will suspect one of these deltas to be "incorrect": it should have just been ∂C or just ∂D.

The git patch-id program can compute a patch ID for any diff hunk, so to find out whether some big-delta ∆X is equal to the negative of some previous ∆B plus some smaller ∂X, you could just split up the original ∆B into its component hunks, negate each one, and get its patch ID. [Edit: use André Sassi's suggestion—simplify this by computing the reverse diff initially, either by using B B^ as the two inputs to git diff, or using the -R flag, which is available in both git diff and git show.] Put these in a master list L. Then, for any subsequent commit X, starting with l=L, walk through the component hunks of ∆X. Compute that hunk's patch ID, and see if that corresponds to the next value that is in your list. If so, drop one item from l. If you drop the last item from l, you have just found that ∆X includes -∆B. Otherwise, move on to the next hunk in ∆X. If you reach the end of ∆X without dropping all values from l, you have concluded that ∆X does not include -∆B.

Repeat for all suspect commits and you'll see if you can automatically find these "accidental reversions". Note that a deliberate reversion will have nothing but -∆B, and will also have a commit message that says "revert ...".

A minor improvement would be to accept a commit that has most, but not all, hunks in L (like 70% or 80% of them), because sometimes one of these hunks may have been overwritten for some new code - this would be a conflict if B (from @poke answer) did the right thing. Much like git detects a file has been renamed even if there were some changes. — André Sassi, Dec 19 '16 at 23:43
@torek, André Sassi: yes, this is what I meant. So it seems that this has not been implemented so far... — tillmo, Dec 20 '16 at 09:15

score 0 · Answer 2 · answered Dec 19 '16 at 21:40

The simplest you can do is use git log -S. Say, you introduced the line

FILE* = fopen(config_file, "r");

and somebody else removed it.

Then the following command will list the commits that changes the number of occurrences of a particular string in the code base:

git log -S'fopen(config_file,'

The first commit listed is the one that removed the line, the second commit the one that introduced it.

Note that commits that move the substring around (even to a different file) are not listed. Additionally, if the string was removed in a merge such that the version from one of the parents was taken literally, the merge will not be listed. (This can cause some confusion, option -m can help in the latter case.)

This is far too cumbersome. I want to be able to automatically detect such situations in an existing git repo without the need to take care of each individual code line. — tillmo, Dec 19 '16 at 22:30

score 0 · Answer 3 · answered Dec 19 '16 at 21:46

This is actually somewhat difficult to reproduce. Say we have contributors Alice (A) and Bob (B) working on the repository. The current base commit is X:

          master
            ↓
* --- * --- X

Now, both A and B have that commit checked out and work on it in the same file. A is first, commits and pushes the change:

                master
                  ↓
* --- * --- X --- Y

Now, B is still based on X. There are now two possible situations this can continue:

B commits their changes. A push will fail since Y was pushed in the meantime, so B has to pull first, likely resulting in a merge conflict (since both A and B are working in the same file). B is notified of that situation and has to resolve the conflict before they can go on.

So any reversal of the changes introduced with Y are deliberately made by B.
B does not commit the changes but attempts to pull first (maybe because B knows that there are new changes on the remote). Since the pull would attempt to edit the file that B is working on (and has changes in), Git will refuse to actually update the working directory with the message that “local changes would be overwritten”.

So B again has to resolve those conflicts first before the changes can even be pulled in. Any reversal of Y are again deliberate.

So basically, if this happens, you really have to blame B for undoing A’s work. Outside of these two situations, there is not really a way for the situation you explained to happen.

The only thing I can actually think of is a situation similar to (2) but where B never saved their changes. So when Y is pulled in, Git has a clean working directory (although there are unsaved changes in the file) and correctly updates the local branch. Then, afterwards, B can save the file resulting in only those changes, ignoring whatever happened in Y.

But, this again means that when B finally commits, that they include those undoing changes deliberately – or B simply does not check at all what changes are included in the commit which is a dangerous workflow for many other reasons.

So you should solve this by educating B to actually take care when creating commits, to check what changes will be included in the commit and to make sure that nothing unrelated is added. In doubt, disallow direct pushing to your repository and use a pull request or code review based workflow to ensure that the changes that end up in the repository are actually fine.

This is all true. However, as I said, I still want to be able to detect such situations, also regarding the past where educating B etc. obviously does not help. Also, I am aware that there can be situations where commits are deliberately undone. However, since this occurs not so frequently, I am happy to have them as false positives. — tillmo, Dec 19 '16 at 22:28
I think you are asking for something that cannot be done. Since, you would essentially be getting a notification every time anything was committed. Since, any time you change code, you are un-doing someone elses work. I don't think there is a way for git to intelligently know WHAT code is important to you. — Kenyon, Dec 19 '16 at 23:06
I'd like to add that B should not only be educated about care to make commits, but to take care to save more often. How does one calls `git pull` without even saving the file before? — André Sassi, Dec 19 '16 at 23:08

how to detect a potentially evil git commit (containing the inverse of some other commit)?

3 Answers3