How do I track Git commits for a time period using GitPython

Question

I want to know how to track if a branch in Git is pushed or not. Basically, I want to find all branches from a develop branch (not repo) and be able to check if any of those branches are pushed (after some changes) or not.

So far, using GitPython, details in here, and here I could figure out the following:

  import git
  from git import Repo

  repo = Repo('directory_of_repo') #points to develop not master
  paths = set()
  for item in repo.head.commit.diff('develop@{8 days ago}'):
    if item.a_path.find('a_certain_dirctory') != -1:
        paths.add(item.a_path)

while the mg repo is pointing to develop. Now I am not sure that I should use HEAD@{8 days ago} or develop@{8 days ago} (P.S. number of days can be different). However, not sure that I should use the HEAD or the develop? In the example of 8 days back, using HEAD, the number of the unique paths is 86 while using develop, the number is 15.

What exactly l am looking for is to find all the paths that have been changed (i.e. some file inside them are updated) in a certain period of time (for example 8 days ago) in the develop branch. Any guidance that which one should I use (HEAD or develop) to track changes for a certain time period on develop?

torek · Answer 1 · 2019-12-14T01:37:20.177

I want to find all branches from a master branch (not repo)

This does not make sense, because branches do not branch from branches.

... and be able to check if any of those branches are pushed (after some changes) or not.

This question could make sense, if rephrased somewhat.

The key to both of these is to understand what branches are, or are not. But before you can do that, you have to realize that not everyone means the same thing by the word branch. In fact, someone can say the word branch more than once in the same sentence and mean two or more different things.

(Related: What exactly do we mean by "branch"?)

What really matters, in any repository, is not the branches. The commits are what matter. Branches are just how you find commits. In this particular case, what I mean by the word branch is a branch name, such as master or develop. These names are specific to this one repository. A clone of this repository, in some other Git, perhaps on some other computer or on some cloud-server or whatever, has its own branch names, independent of your own.

When you connect two Git repositories to each other, using git fetch or git push, one Git sends commits to the other Git, which receives them. The receiving Git can see some or all of the sending Git's names (branch names, tag names, and other names), but what really matters are the commits. Having sent or received some commits, though, we're left with the problem of finding the commits.

Every commit has a unique hash ID. This hash ID is big and ugly and impossible for humans to remember. Fortunately, each commit remembers some set of previous commit hash IDs—usually exactly one. Git calls this the parent commit. The child commit remembers the hash ID of its parent. When you make a new commit, Git assigns the new commit a new, unique hash ID, and puts the hash ID of the commit you were using, just now, into the new child commit as the child's parent. (And now you're using the child commit.)

Of course, that parent commit is probably itself the child of some previous commit. So the parent remembers its parent—the grandparent of the child you just created—and that commit remembers its parent, and so on. The result is a long, backwards-pointing chain, in which the last commit is perhaps the most interesting:

... <-F <-G <-H

Here H stands for the hash ID of the last commit. Because H holds the hash ID of its parent G, we can use H to find G. Meanwhile, G holds the hash ID of its parent F, so we can use G to find F, and so on. These backwards-pointing arrows mean that we only need to have Git remember for us the hash ID of the last commit in the chain.

This is what branch names are for. They hold the hash ID of the last commit in the chain:

...--F--G--H   <-- master, branch2, branch3

Note that here, all three names identify commit H.

If we git checkout master and then make a new commit, it will get some big ugly hash ID that we'll call I. New commit I will point back to existing commit H as its parent:

...--F--G--H
            \
             I

and now, because we picked master as our branch to git checkout, Git will update the name master to hold the hash ID of new commit I:

...--F--G--H   <-- branch2, branch3
            \
             I   <-- master (HEAD)

The attached (HEAD) is the way for us (and Git) to know which branch name to move when we make new commits. The other two branch names—branch2 and branch3—have not changed. If we git checkout branch3 we get:

...--F--G--H   <-- branch2, branch3 (HEAD)
            \
             I   <-- master

and if we now make a new commit, we get:

             J   <-- branch3 (HEAD)
            /
...--F--G--H   <-- branch2
            \
             I   <-- master

That's almost all there is to it: branch names are just pointers, pointing to commits.

If we have our Git call up some other Git over the Internet-phone, our Git can tell their Git: Hey, I have commit I, do you have it? If they say no, our Git can give them commit I. All Gits in the universe will agree that commit I gets commit I's hash ID, and no other commit gets this hash ID. So they just have to exchange the hash IDs first: the actual contents of the commit—the snapshot of all files—can go later if needed (and can be compressed down to just what the other Git really needs), and it's just the hash IDs that matter here.

Once we have given them I—and H too if they need that, and G too if needed— they may have something that looks like this:

...--F   <-- master
      \
       G--H--I

That is, they have a name, master, pointing to their existing commit F, which has the same hash ID as our existing F and is therefore the same commit with the same files. And now, they have G-H-I too, with I pointing back to H, H pointing back to G, and G pointing back to F. But they have no name by which to find commit I.

So, our Git, having sent them commit I (and any earlier commits required), will now send them a polite request: Please, if it's OK, change your name master to point to commit I. It is up to them to decide whether or not to obey this polite request. If they do obey, they will stuff the raw hash ID—whatever that is—of commit I into their name master.

So:

... be able to check if any of those branches are pushed

This still isn't a sensible thing to do as phrased. But what we can do is call them—this other Git—up, ask them about the hash ID in their name master, and compare it to the hash ID in our name master. Are these the same hash IDs? If so, we're in sync. If not, we're not.

Exactly how we're out of sync, we won't know. We'll just know if we are in sync or not. That's probably the question you wanted here. (If you want to know exactly how we're out of sync, if we are out of sync, that's a more difficult question.)

So, to answer this new and different question, we should call up their Git, have them list their branch names and contained raw hash IDs, and compare those to our branch names and contained raw hash IDs. They'll match, or not; or perhaps we'll have branch names they don't and vice versa.

Before you do any of this, though, consider one last feature of git fetch (implemented by Git anyway: this may or may not be in your Python library, depending on how precisely it mimics Git or if it uses Git directly). I can use git fetch to have my Git connect to your Git:

git remote add my-name-for-you <url-for-your-git>
git fetch my-name-for-you

When I do this, your Git tells my Git all of its branch and tag names. My Git then lets me pick which names I like—the default is that I like all of them—and it gets the last commit from each of your branch names from you, and any earlier commits I need as well, so that I have all of your commits. Then, in my Git, it creates or updates remote-tracking names for each of your branch names:

my my-name-for-you/master holds the hash ID that your master holds;
my my-name-for-you/develop holds the hash ID that your develop holds;
... and so on, for every branch name you have.

So instead of calling your Git up again every time, I can just use my Git's memory of your Git's branch names.

If my Git's memory is out of date, I just run git fetch my-name-for-you. My Git calls up your Git, updates my memory of your names, and obtains all of your commits.

If I'm giving you commits—if I run git push my-name-for-you master—I'll send you the commits and ask your Git to set your master. Your Git will either obey, or say no and tell me a little bit about why it said no. If your Git obeys, my Git will update my my-name-for-you/master to remember that your master now stores the same hash ID I just sent you.

So, in general, instead of connecting to some other Git, you just inspect the hash IDs of your own origin/* names. The name origin is the default name for the default remote, created when you first made your Git repository by cloning some other Git repository. If necessary, you can run git fetch origin before checking the origin/* names.

(For some special tool purposes, it may sometimes be better to use git ls-remote instead of git fetch. This obtains the names and hash IDs—which is the first step of an actual git fetch—but then just prints them out and stops, instead of going on to do the rest of the git fetch work. The downside is that you'll probably eventually need to git fetch, but the upside is that you get a picture that's accurate for the moment, without waiting for git fetch to work. This moment may not be very long, depending on how active the other Git is.)

Super! Very educational and helpfull as well as very appreciated. I was reading about it more and was going to the same direction and your summary is very helpful. I came up w/ a solution using the PythonGit and Python; Will post it here whenever it's tested and I get sure about it. — Alan, Dec 14 '19 at 00:41

score 0 · Accepted Answer · answered Dec 28 '19 at 01:21

using log and since option we can get the all changed files from a time (--since) along w/ the --name-only option.

  from git import Git
  from datetime import date
  import datetime as DT

  def _get_the_changed_components(self):
      g = Git(self.repo_directory) # repo directory points to `develop`
      today = date.today()
      since = today - DT.timedelta(self.time_period) #some times ago
      loginfo = g.log('--since={}'.format(since), '--pretty=tformat:', '--name-only')
      files = loginfo.split('\n')
      for file in files:
          self.paths.add(file)

How do I track Git commits for a time period using GitPython

2 Answers2