3

Right now I am able to traverse through the commit tree for a github repository using pygit2 library. I am getting all the commits for each file change in the repository. This means that I am getting changes for text files with extensions .rtf as well in the repository. How do I filter out the commits which are related to code changes only? I don't want the changes related to text documents.

Appreciate any help or pointers. Thanks.

last = repo[repo.head.target]

t0=last

f = open(outputFile,'w')

print t0.hex


for commit in repo.walk(last.id):
     if t0.hex == commit.hex:
        continue

     print commit.hex
     out=repo.diff(t0,commit)
     f.write(out.patch)
     t0=commit;

As part of the output, I get the difference in rtf files as well as below:

diff --git a/archived-output/NEW/action-core[best].rtf b/archived-output/NEW/action-core[best].rtf
deleted file mode 100644
index 56cdec6..0000000
--- a/archived-output/NEW/action-core[best].rtf
+++ /dev/null
@@ -1,8935 +0,0 @@
-{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f0\fbidi \froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;}{\f1\fbidi \fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;}
-{\f2\fbidi \fmodern\fcharset0\fprq1{\*\panose 02070309020205020404}Courier New;}{\f3\fbidi \froman\fcharset2\fprq2{\*\panose 05050102010706020507}Symbol;}

Either I have to filter the commits from the tree or I have to filter the output . I was thinking if I could remove the changes related to rtf files by removing the corresponding commits while walking through the tree.

Zack
  • 2,078
  • 10
  • 33
  • 58
  • Can you update the post with what your current code looks like to do this? IMO, this is too broad as it stands now. – Anshul Goyal Feb 07 '15 at 20:43
  • @mu無 I have added the details . – Zack Feb 07 '15 at 20:51
  • Remember that a commit can (and often does) involve more than a single file, so simply "removing the corresponding commits" is probably the wrong solution. I suspect you have to iterate through the tree object attached to each commit and look for modified files, and then pick the ones you want. – larsks Feb 07 '15 at 22:34
  • @larsks yes. If that is possible, how do we get the list of modified files? – Zack Feb 08 '15 at 01:32

2 Answers2

3

If that is possible, how do we get the list of modified files?

Ah, now you're asking the right questions! Git, of course, does not store a list of modified files in each commit. Rather, each commit represents the state of the entire repository at a certain point in time. In order to find the modified files, you need to compare the files contained in one commit with the previous commit.

For each commit returned by repo.walk(), the tree attribute refers to the associated Tree object (which is itself a list of TreeEntry objects representing files and directories contained in that particular Tree).

A Tree object has a diff_to_tree() method that can be used to compare it against another Tree object. This returns a Diff object, which acts as an iterator over a list of Patch objects. Each Patch object refers to the changes in a single file between the two Trees that are being compared.

The Patch object is really the key to all this, because this is how we determine which files have been modified.

The following code demonstrates this. For each commit, it will print a list of new, modified, or deleted files:

import stat
import pygit2


repo = pygit2.Repository('.')

prev = None
for cur in repo.walk(repo.head.target):

    if prev is not None:
        print prev.id
        diff = cur.tree.diff_to_tree(prev.tree)
        for patch in diff:
            print patch.status, ':', patch.new_file_path,
            if patch.new_file_path != patch.old_file_path:
                print '(was %s)' % patch.old_file_path,
            print

    if cur.parents:
        prev = cur
        cur = cur.parents[0]

If we run this against a sample repository, we can look at the output for the first few commits:

c285a21e013892ee7601a53df16942cdcbd39fe6
D : fragments/configure-flannel.sh
A : fragments/flannel-config.service.yaml
A : fragments/write-flannel-config.sh
M : kubecluster.yaml
b06de8f2f366204aa1327491fff91574e68cd4ec
M : fragments/enable-services-master.sh
M : fragments/enable-services-minion.sh
c265ddedac7162c103672022633a574ea03edf6f
M : fragments/configure-flannel.sh
88a8bd0eefd45880451f4daffd47f0e592f5a62b
A : fragments/configure-docker-storage.sh
M : fragments/write-heat-params.yaml
M : kubenode.yaml

And compare that to the output of git log --oneline --name-status:

c285a21 configure flannel via systemd unit
D       fragments/configure-flannel.sh
A       fragments/flannel-config.service.yaml
A       fragments/write-flannel-config.sh
M       kubecluster.yaml
b06de8f call daemon-reload before starting services
M       fragments/enable-services-master.sh
M       fragments/enable-services-minion.sh
c265dde fix json syntax problem
M       fragments/configure-flannel.sh
88a8bd0 configure cinder volume for docker storage
A       fragments/configure-docker-storage.sh
M       fragments/write-heat-params.yaml
M       kubenode.yaml

...aaaand, that looks just about identical. Hopefully this is enough to you started.

larsks
  • 277,717
  • 41
  • 399
  • 399
  • Cool ! This will give me a way to check the files associated with the patch ! Really appreciate it. Where can I find detailed information about these things? As of now, I am referring the documentation and the search results. Thanks again! – Zack Feb 08 '15 at 03:23
  • I figured all of this out just now, reading through the [documentation](http://www.pygit2.org/) and fiddling around at the interactive Python prompt. It was [this page](http://www.pygit2.org/diff.html#the-patch-type) that clued me in to how things might work. – larsks Feb 08 '15 at 03:48
  • The very last line `cur = cur.parents[0]` doesn't make sense to me, because `cur` will be overwritten in the next `for` iteration. That would better go to the top of the loop. Then you don't miss the last diff, especially when using a walker for which hide() was called (i.e. walk a..b) – Adrian W Aug 20 '20 at 17:18
2

This is mainly a rewrite of larsks's excellent answer to

  • the current pygit2 API
  • Python3

It also fixes a flaw in the iteration logic: the original code would miss to diff the last revision against its parent when a revision range (a..b) is walked.

The following approximates the command

git log --name-status --pretty="format:Files changed in %h" origin/devel..master

on the sample repository given by larsks.

I was unable to trace file renames, though. This is printed as a deletion and an addition. The code line printing a rename is never reached.

import pygit2

repo = pygit2.Repository('.')

# Show files changed between origin/devel and current HEAD
devel = repo.revparse_single('origin/devel')
walker = repo.walk(repo.head.target)
walker.hide(devel.id)

for cur in walker:
    if cur.parents:
        print (f'Files changed in {cur.short_id}')
        prev = cur.parents[0]

        diff = prev.tree.diff_to_tree(cur.tree)
        for patch in diff:
            print(patch.delta.status_char(), ':', patch.delta.new_file.path)
            if patch.delta.new_file.path != patch.delta.old_file.path:
                print(f'(was {patch.delta.old_file.path})'.)
        print()
Adrian W
  • 4,563
  • 11
  • 38
  • 52