Problem 1: commits
Getting the list of commits is only moderately difficult as you mostly need to run git rev-list
. There are some edge cases here, however. As the githooks documentation says:
Information about what is to be pushed is provided on the hook’s standard input with lines of the form:
<local ref> SP <local sha1> SP <remote ref> SP <remote sha1> LF
For instance, if the command git push origin master:foreign
were run the hook would receive a line like the following:
refs/heads/master 67890 refs/heads/foreign 12345
although the full, 40-character SHA-1s would be supplied. If the foreign ref does not yet exist the <remote SHA-1>
will be 40 0
. If a ref is to be deleted, the <local ref>
will be supplied as (delete)
and the <local SHA-1>
will be 40 0
. If the local commit was specified by something other than a name which could be expanded (such as HEAD~, or a SHA-1) it will be supplied as it was originally given.
Hence, you must read each stdin line and parse it into its components, then decide:
- Is this a branch update at all? (I.e., does the remote ref have the form
refs/heads/*
, as a glob match?) If not, do you want to check any commits anyway?
- Is the reference being created or destroyed? If so, what should you do?
- Do you have the object specified by the foreign hash? (If not, and the push succeeds—it may well fail—this will drop some number of commit objects, but you cannot tell which ones. Moreover, you cannot properly list which local commits will be transferred: you know what you're asking them to set their name to, but you do not know which commits that you and they have in common since you cannot traverse their history.)
Assuming you have determined the answers to these—let's say they are "no", "skip it", and "locally reject pushes that are not analyzable"—we go on to listing the commits, and that is just the output of:
git rev-list remotehash..localhash
which you might do with:
proc = subprocess.Popen(['git', 'rev-list',
'{}..{}'.format(remotehash, localhash)], stdout=subprocess.PIPE)
text = proc.stdout.read()
if proc.wait():
raise ... # some appropriate error, as Git failed here
if not isinstance(text, str): # i.e., if python3
text = text.decode('utf-8') # convert bytes to str
lines = text.split('\n')
# now work with each commit hash
Note that this git rev-list
call will fail (exit with a nonzero status) if the remote or local hash is all-zeros, or if the remote hash is for an object that does not exist in your local repository (you can check this using git rev-parse --verify --quiet
and checking the return status, or perhaps use the failure here as your indication that you cannot check the commits, although there are other options when creating a new branch).
Note that you must run the above git rev-list
for each reference that is to be updated. It's possible that the same commits, or some subset of the same commits, will be sent for different references. For instance:
git push origin HEAD:br1 HEAD:br2 HEAD~3:br3
would request that the remote update three branches br1
through br3
, setting br1
and br2
to the same commit as HEAD
and setting br3
to the commit three steps back from HEAD
. We do not (and cannot) know which commits are truly new—the pre-receive hook on the other end could figure that out, but we cannot—but if the remote's br1
and br2
are both being updated from HEAD~3
to HEAD
, and the remote's br3
is being updated from HEAD~2
backwards to HEAD~3
, at most the commits HEAD~1
through HEAD
can be new. Whether you want to check HEAD~2
as well, since it is now likely to show up on br1
and br2
in the other repository (even though it was already on br3
there), is also up to you.
Problem 2: files
Now you have the more difficult problem. You mentioned in an edit that:
EDIT: I think it's also important to note that the same files could be modified across multiple commits that are about to be pushed - so it would be good to not end up validating the same file multiple times.
Each commit to be sent has a complete snapshot of the repository. That is, each commit has every file. I have no idea what validation you intend to run, but you are correct: if you are sending, say, six commits total, it's pretty likely that most of the files in all six commits are the same, and only a few files are modified. However, file foo.py
might be modified in commit 1234567
(with respect to 1234567
's parent commit), and then modified again in commit fedcba9
, and you probably should check both versions.
Moreover, when a commit is a merge commit, it has (at least) two different parents. Should you check a file if it is different from either parent? Or should you check it only if it differs from both parents, indicating that it contains changes from "both sides" of the merge? If it has only changes from "one side", the file is probably "pre-checked" by whatever checks happened for the commit that is on the other side, and hence it may not need to be re-checked (though of course this depends on the kind of checking).
(For an octopus merge, i.e., a merge with more than two parents, this question gets substantially harder to think about.)
It's relatively easy to see which files are changed in a commit, with respect to its parent or parents: simply run git diff-tree
with appropriate options (notably, -r
to recurse into sub-trees of the commit). The default output format is quite machine-parseable, though you might want to add -z
to make it easier to handle directly within Python. If you are doing these one at a time—which you might as well—you probably also want --no-commit-id
so that you need not read and skip the commit header.
It's up to you whether you want to enable rename detection and if so, at what threshold. Depending, again, on precisely what you are doing to verify files, leaving rename detection off is often best: that way you will "see" a renamed file as a deletion of the old path and an addition of the new path.
The output from git diff-tree -r --no-commit-id
on a particular commit looks like this:
:000000 100644 0000000000000000000000000000000000000000 b0b4c36f9780eaa600232fec1adee9e6ba23efe5 A Documentation/RelNotes/2.13.0.txt
:100755 100755 6a208e92bf30c849028268b5fca54b902f671bbd 817d1cf7ef2a2a99ab11e5a88a27dfea673fec79 M GIT-VERSION-GEN
:120000 120000 d09c3d51093ac9e4da65e8a127b17ac9023520b5 125bf78f3b9ed2f1444e1873ed02cce9f0f4c5b8 M RelNotes
The hash IDs are the old and new blob hashes; the letter codes and path names are as documented. You can then retrieve the file contents using git cat-file -p
on the new hash ID. If your Git is new enough, you can even get any .gitattributes
-based filtering and end of line conversion applied by adding --textconv
--filters
, and --path=<path>
(or using the file's path together with the commit ID, instead of --path=...
, to name the hash of the object to be extracted). Or you can just use the form of the object stored in the repository, if filters are not important.
Depending on just what you are checking, though, you might need to extract the entire commit into a temporary work-tree. (For instance a static analyzer might want to execute any import
s.) In this case you might as well just run git checkout
, using the GIT_INDEX_FILE
environment variable (pass this via subprocess
as usual) to specify a temporary index file so that the main index is not disturbed. Specify an alternative work-tree with --work-tree=
or via the GIT_WORK_TREE
environment variable. In any case the git diff-tree
will tell you which files were modified, and therefore should be checked. (You can use shutil.rmtree
to dispose of the temporary work-tree once the testing is complete.)
If you are going to check merge commits, pay special attention to the description of combined diffs done for merges, as they will require somewhat different treatment (or splitting the merge with -m
).
Edit: some code to show what I mean
Here is a bit of code to obtain all the inputs and show each commit being added to each foreign branch. Note that the list of added commits will be empty if commits are only being removed. This is also only very lightly tested, and not intended to be robust, maintainable, good style, etc., just sort of a minimal example.
import re, subprocess, sys
lines = sys.stdin.read().splitlines()
for line in lines:
localref, localhash, foreignref, foreignhash = line.split()
if not foreignref.startswith('refs/heads/'):
print('skip {}'.format(foreignref))
continue
if re.match('0+$', localhash):
print('deleting {}, do nothing'.format(foreignref))
continue
if re.match('0+$', foreignhash):
print('creating {}, too hard for now'.format(foreignref))
continue
proc = subprocess.Popen(['git', 'rev-parse', '--quiet', '--verify',
foreignhash],
stdout=subprocess.PIPE)
_ = proc.stdout.read()
status = proc.wait()
if status:
print('we do not have {} for {}, try '
'git fetch'.format(foreignhash, foreignref))
# can try to run git fetch here ourselves, but for now:
continue
print('sending these commits for {}:'.format(foreignref))
subprocess.call(['git', 'rev-list', '{}..{}'.format(localhash, foreignhash)])