What are the git commands to iterate through files being pushed?

Question

I am trying to implement a pre-push git hook in python to validate files before they are pushed to the remote repo.

I have previously written a pre-commit git hook to validate files before they are committed to the local repo and to get a list of the files in the commit, I ran git diff-index --cached --name-status HEAD.

For the pre-push script, what git commands can I run to iterate through all of the commits about to be pushed, and then iterate through all of the files in the individual commits so that I can validate them?

So far I am using the command: git diff --name-status @{u}..

EDIT: I think it's also important to note that the same files could be modified across multiple commits that are about to be pushed - so it would be good to not end up validating the same file multiple times.

FINAL_SOLUTION:

Here is the code I ended up using thanks to @Vampire's and @Torek's answer...

#!/usr/bin/env python

# read the args provided by git from stdin that are in the following format...
# <local ref> SP <local sha1> SP <remote ref> SP <remote sha1> LF
# the line above represents a branch being pushed
# Note: multiple branches may be pushed at once

lines = sys.stdin.read().splitlines()

for line in lines:
    local_ref, local_sha1, remote_ref, remote_sha1 = line.split()

    if remote_sha1 == "0000000000000000000000000000000000000000":
        print_error("Local branch '%s' cannot be found on the remote repo - push only the branch without any commits first!" % local_ref)
        sys.exit(1)

    # get changed files
    changed_files = subprocess.check_output(["git", "diff", "--name-status", local_sha1, remote_sha1], universal_newlines=True)

    # get the non deleted files while getting rid of M\t or A\t (etc) characters from the diff output
    non_deleted_files = [ f[2:] for f in changed_files.split("\n") if f and not f.startswith("D") ]

    # validation here...
    if validation_failed:
        sys.exit(1) # terminate the push

sys.exit(0)

That solution is going to fail in some cases (not necessarily all!) when you do a force-push to "rewrite history" (e.g., after `git rebase`). It also does not check any intermediate commits, nor any additional updated references beyond the first, if you push more than one (e.g., `git push --all` or `git push` with `push.default` set to `matching`). If that's all you want, though, it should suffice. (Well, it needs some work for Python3, but that's fine as well: you can just stick with Python 2.7.) — torek, Mar 23 '17 at 03:06
@torek Well that's lame... So if I commit 3 times then push it will only run pre-push on the last push... — Ogen, Mar 23 '17 at 20:30
That's why I suggested (1) reading *all* of stdin, and then (2) using `git rev-list` (which does require an occasional `git fetch` to fill in unknown hash IDs—this probably *can* be done inside the pre-push hook but I have never tried it) on *each* update and walking the commits looking at `git diff-tree` results. — torek, Mar 23 '17 at 21:54
@torek I thought `sys.stdin.read()` _does_ read all of stdin — Ogen, Mar 23 '17 at 22:07
It does, but then you have to actually use it all. (Poor wording on my part, since `sys.stdin.read()` does *read* all of it. :-) ) — torek, Mar 23 '17 at 22:09
@torek I just tested my code and it does work. If I commit three times without pushing - and then I push, the remote sha1 is the hash for the commit before my three commits and the local sha1 is the hash for the latest commit on my local branch. So the diff shows all the differences. The only time it fails is when I create a local branch - commit 3 times - then push. It fails because the remote sha1 in this case is 0000000... which is a fatal/bad object. Not sure how to get around this case... — Ogen, Mar 25 '17 at 00:33
Not what I meant. Try creating branches `b1` and `b2` and push their creation (with no changes—this is the all-zeros case) to remote; then check out `b1`, change one file, and commit; and **without pushing** check out `b2` and change a separate file and commit. *Now* run `git push origin b1 b2`. — torek, Mar 25 '17 at 03:27
(Meanwhile, if you do want to handle branch creation, there are *some* ways to do it, but none I'd call fully satisfactory.) — torek, Mar 25 '17 at 03:29

torek · Accepted Answer · 2017-03-25T03:59:31.217

Problem 1: commits

Getting the list of commits is only moderately difficult as you mostly need to run git rev-list. There are some edge cases here, however. As the githooks documentation says:

Information about what is to be pushed is provided on the hook’s standard input with lines of the form:

<local ref> SP <local sha1> SP <remote ref> SP <remote sha1> LF

For instance, if the command git push origin master:foreign were run the hook would receive a line like the following:

refs/heads/master 67890 refs/heads/foreign 12345

although the full, 40-character SHA-1s would be supplied. If the foreign ref does not yet exist the <remote SHA-1> will be 40 0. If a ref is to be deleted, the <local ref> will be supplied as (delete) and the <local SHA-1> will be 40 0. If the local commit was specified by something other than a name which could be expanded (such as HEAD~, or a SHA-1) it will be supplied as it was originally given.

Hence, you must read each stdin line and parse it into its components, then decide:

Is this a branch update at all? (I.e., does the remote ref have the form refs/heads/*, as a glob match?) If not, do you want to check any commits anyway?
Is the reference being created or destroyed? If so, what should you do?
Do you have the object specified by the foreign hash? (If not, and the push succeeds—it may well fail—this will drop some number of commit objects, but you cannot tell which ones. Moreover, you cannot properly list which local commits will be transferred: you know what you're asking them to set their name to, but you do not know which commits that you and they have in common since you cannot traverse their history.)

Assuming you have determined the answers to these—let's say they are "no", "skip it", and "locally reject pushes that are not analyzable"—we go on to listing the commits, and that is just the output of:

git rev-list remotehash..localhash

which you might do with:

proc = subprocess.Popen(['git', 'rev-list',
    '{}..{}'.format(remotehash, localhash)], stdout=subprocess.PIPE)
text = proc.stdout.read()
if proc.wait():
    raise ... # some appropriate error, as Git failed here
if not isinstance(text, str):   # i.e., if python3
    text = text.decode('utf-8') # convert bytes to str
lines = text.split('\n')
# now work with each commit hash

Note that this git rev-list call will fail (exit with a nonzero status) if the remote or local hash is all-zeros, or if the remote hash is for an object that does not exist in your local repository (you can check this using git rev-parse --verify --quiet and checking the return status, or perhaps use the failure here as your indication that you cannot check the commits, although there are other options when creating a new branch).

Note that you must run the above git rev-list for each reference that is to be updated. It's possible that the same commits, or some subset of the same commits, will be sent for different references. For instance:

git push origin HEAD:br1 HEAD:br2 HEAD~3:br3

would request that the remote update three branches br1 through br3, setting br1 and br2 to the same commit as HEAD and setting br3 to the commit three steps back from HEAD. We do not (and cannot) know which commits are truly new—the pre-receive hook on the other end could figure that out, but we cannot—but if the remote's br1 and br2 are both being updated from HEAD~3 to HEAD, and the remote's br3 is being updated from HEAD~2 backwards to HEAD~3, at most the commits HEAD~1 through HEAD can be new. Whether you want to check HEAD~2 as well, since it is now likely to show up on br1 and br2 in the other repository (even though it was already on br3 there), is also up to you.

Problem 2: files

Now you have the more difficult problem. You mentioned in an edit that:

EDIT: I think it's also important to note that the same files could be modified across multiple commits that are about to be pushed - so it would be good to not end up validating the same file multiple times.

Each commit to be sent has a complete snapshot of the repository. That is, each commit has every file. I have no idea what validation you intend to run, but you are correct: if you are sending, say, six commits total, it's pretty likely that most of the files in all six commits are the same, and only a few files are modified. However, file foo.py might be modified in commit 1234567 (with respect to 1234567's parent commit), and then modified again in commit fedcba9, and you probably should check both versions.

Moreover, when a commit is a merge commit, it has (at least) two different parents. Should you check a file if it is different from either parent? Or should you check it only if it differs from both parents, indicating that it contains changes from "both sides" of the merge? If it has only changes from "one side", the file is probably "pre-checked" by whatever checks happened for the commit that is on the other side, and hence it may not need to be re-checked (though of course this depends on the kind of checking).

(For an octopus merge, i.e., a merge with more than two parents, this question gets substantially harder to think about.)

It's relatively easy to see which files are changed in a commit, with respect to its parent or parents: simply run git diff-tree with appropriate options (notably, -r to recurse into sub-trees of the commit). The default output format is quite machine-parseable, though you might want to add -z to make it easier to handle directly within Python. If you are doing these one at a time—which you might as well—you probably also want --no-commit-id so that you need not read and skip the commit header.

It's up to you whether you want to enable rename detection and if so, at what threshold. Depending, again, on precisely what you are doing to verify files, leaving rename detection off is often best: that way you will "see" a renamed file as a deletion of the old path and an addition of the new path.

The output from git diff-tree -r --no-commit-id on a particular commit looks like this:

:000000 100644 0000000000000000000000000000000000000000 b0b4c36f9780eaa600232fec1adee9e6ba23efe5 A  Documentation/RelNotes/2.13.0.txt
:100755 100755 6a208e92bf30c849028268b5fca54b902f671bbd 817d1cf7ef2a2a99ab11e5a88a27dfea673fec79 M  GIT-VERSION-GEN
:120000 120000 d09c3d51093ac9e4da65e8a127b17ac9023520b5 125bf78f3b9ed2f1444e1873ed02cce9f0f4c5b8 M  RelNotes

The hash IDs are the old and new blob hashes; the letter codes and path names are as documented. You can then retrieve the file contents using git cat-file -p on the new hash ID. If your Git is new enough, you can even get any .gitattributes-based filtering and end of line conversion applied by adding --textconv --filters, and --path=<path> (or using the file's path together with the commit ID, instead of --path=..., to name the hash of the object to be extracted). Or you can just use the form of the object stored in the repository, if filters are not important.

Depending on just what you are checking, though, you might need to extract the entire commit into a temporary work-tree. (For instance a static analyzer might want to execute any imports.) In this case you might as well just run git checkout, using the GIT_INDEX_FILE environment variable (pass this via subprocess as usual) to specify a temporary index file so that the main index is not disturbed. Specify an alternative work-tree with --work-tree= or via the GIT_WORK_TREE environment variable. In any case the git diff-tree will tell you which files were modified, and therefore should be checked. (You can use shutil.rmtree to dispose of the temporary work-tree once the testing is complete.)

If you are going to check merge commits, pay special attention to the description of combined diffs done for merges, as they will require somewhat different treatment (or splitting the merge with -m).

Edit: some code to show what I mean

Here is a bit of code to obtain all the inputs and show each commit being added to each foreign branch. Note that the list of added commits will be empty if commits are only being removed. This is also only very lightly tested, and not intended to be robust, maintainable, good style, etc., just sort of a minimal example.

import re, subprocess, sys

lines = sys.stdin.read().splitlines()
for line in lines:
    localref, localhash, foreignref, foreignhash = line.split()
    if not foreignref.startswith('refs/heads/'):
        print('skip {}'.format(foreignref))
        continue
    if re.match('0+$', localhash):
        print('deleting {}, do nothing'.format(foreignref))
        continue
    if re.match('0+$', foreignhash):
        print('creating {}, too hard for now'.format(foreignref))
        continue
    proc = subprocess.Popen(['git', 'rev-parse', '--quiet', '--verify',
            foreignhash],
        stdout=subprocess.PIPE)
    _ = proc.stdout.read()
    status = proc.wait()
    if status:
        print('we do not have {} for {}, try '
            'git fetch'.format(foreignhash, foreignref))
        # can try to run git fetch here ourselves, but for now:
        continue
    print('sending these commits for {}:'.format(foreignref))
    subprocess.call(['git', 'rev-list', '{}..{}'.format(localhash, foreignhash)])

I updated my hook to handle multiple branches being included in a push by running my code validation for each branch. I am checking if the remote hash is only zeros - in which case I am just terminating the push and telling the user to push the branch. However this is unacceptable because now i can't push new branches... my hooks fail the process. I can only push commits to an existing branch because the remote hash isn't only zeros in that case. — Ogen, Mar 26 '17 at 01:53
I updated the question to include the code I am currently using too — Ogen, Mar 26 '17 at 02:02
Well: it may be worth trying a `git fetch` where I have suggested it (and then see if we acquire the missing object), and for the case where we are creating the branch, treat *all* files at the local branch-tip as "added" by diffing against the hash of the [empty tree (click on this link for details)](http://stackoverflow.com/q/9765453/1256452). Since you're not examining *any* of the intermediate commits, that might be a reasonable approach. — torek, Mar 26 '17 at 04:16

score 1 · Answer 2 · answered Mar 23 '17 at 00:14

Using @{u}.. is little helful, as it will diff the the upstream of HEAD against HEAD if there is an upstream defined at all. But this does not necessarily have anything to do with what is pushed, as you can push any branch or actually any commit-ish, regardless of what is checked out currently and to any remote branch you wish, regardless of the upstream setting.

As per the documentation of githooks, you get the remote name and location as parameters to your script and on stdin you get one line per pushed "thing" with the local and remote ref and local and remote sha. So you need to iterate over stdin and diff the remote sha that is pushed to against the local sha you push to get the files that are different.

What are the git commands to iterate through files being pushed?

2 Answers2

Problem 1: commits

Problem 2: files

Edit: some code to show what I mean