3

My goal is to access existing Git repos from Python. I want to get repo history and on demand diffs.

In order to do that I started with dulwich. So I tried:

from dulwich.repo import Repo
Repo.init('/home/umpirsky/Projects/my-exising-git-repo')

and got OSError: [Errno 17] File exists: '/home/umpirsky/Projects/my-exising-git-repo/.git

The doc says You can open an existing repository or you can create a new one..

Any idea how to do that? Can I fetch history and diffs with dulwich? Can you recommand any other lib for Git access? I am developing Ubuntu app, so it would be appriciated to have ubuntu package for easier deployment.

I will also check periodically to detect new changes in repo, so I would rather work with remote so I can detect changes that are not pulled to local yet. I'm not sure how this should work, so any help will be appriciated.

Thanks in advance.

umpirsky
  • 9,902
  • 13
  • 71
  • 96
  • 1
    I don't know dulwich, but a hint. "init" in GIT is the command to **create** a repositary. So you hve to look futher in the docs. – Niclas Nilsson Jan 04 '12 at 09:05
  • Yeah, I think that too, but the doc is so poor :) As you can see http://www.samba.org/~jelmer/dulwich/apidocs/dulwich.repo.BaseRepo.html init is not documented. I tried Repo.init("myrepo", mkdir=False) http://www.samba.org/~jelmer/dulwich/docs/tutorial/object-store.html but I got same error. – umpirsky Jan 04 '12 at 09:12
  • Yeah! I tried to get git-python working a year ago or so. But I gave up. Then I read about the different GIT modules to python and ended up just calling python via subprocess – Niclas Nilsson Jan 04 '12 at 09:34

2 Answers2

6

I think that init method is used to create a new repository, to open an existing one you just pass the path to it this way:

from dulwich.repo import Repo
repo = Repo(<path>)

For a summary of alternative libraries please have a look at this answer. Basically, it suggests that it's easier to use subprocess module because it's the best way to use the interface you already know.

Community
  • 1
  • 1
jcollado
  • 39,419
  • 8
  • 102
  • 133
  • Thanks, that did the trick. I am able to fetch revision history now with `repo.revision_history(repo.head())`. I'm not sure how to fetch list of changed files in particular commit, but I will continue to experiment. The reason why I don't use `subprocess` is I want to avoid parsing output. The question is still how to check for new revisions that are not pulled locally. I guess `repo.revision_history(repo.head())` is looking only local checkout. – umpirsky Jan 04 '12 at 09:41
  • Can Repo be created from URI instead path? I see there is Function get_transport_and_path - Obtain a git client from a URI or path http://www.samba.org/~jelmer/dulwich/apidocs/dulwich.client.html#get_transport_and_path This returns GitClient, but I can't do much with it. On the other hand Repo says in comment `A git repository backed by local disk.`. – umpirsky Jan 04 '12 at 11:55
  • According to the `dulwich.repo.Repo` docstring, you can just open a "git repository backed by local disk". It looks like `dulwich.web.get_repo` might be useful to do what you need, but haven't figured out what's the right backend to use. – jcollado Jan 04 '12 at 12:29
  • Ugh, I don't know, after pysvn I thought git will be piece of cake, but looks like I was wrong. After hours of messing around with dulwich I started looking at GitPython http://packages.python.org/GitPython/0.3.1/tutorial.html – umpirsky Jan 04 '12 at 14:26
  • 1
    The git protocol makes it impossible to really open a remote repository - all you can do is fetch pack files or tarballs and upload pack files. dulwich.web contains a Git server implementation for HTTP, you can't use it to open HTTP repositories as if they were local. – jelmer Jan 04 '12 at 16:47
4

Most of Dulwich' documentation assumes a fair bit of knowledge of the Git file formats/protocols.

You should be able to open an existing repository with Repo:

from dulwich.repo import Repo
x = Repo("/path/to/git/repo")

or create a new one:

x = Repo.init("/path/to/new/repo")

To get the diff for a particular commit (the diff with its first parent)

from dulwich.patch import write_tree_diff
commit = x[commit_id]
parent_commit = x[commit.parents[0]]
write_tree_diff(sys.stdout, x.object_store, parent_commit.tree, commit.tree)

The Git protocol only allows fetching/sending packs, it doesn't allow direct access to specific objects in the database. This means that to inspect a remote repository you first have to fetch the relevant commits from the remote repo and then you can view them:

from dulwich.client import get_transport_and_path
client, path = get_transport_and_path(remote_url)
remote_refs = client.fetch(path, x)
print x[remote_refs["refs/heads/master"]]
jelmer
  • 2,405
  • 14
  • 27
  • Thanks, great answer. Allow me to ask some sub-questions, since I'm new in this. This remote_refs from your last snippet, what commits are they? I fetched for example your dulwich git repo http://gist.github.com/1562112 and I got strange commits that are not listed on https://github.com/jelmer/dulwich/commits/master and looks like they are not sorted by commit time. My goal is to fetch all commits sorted by commit time (limited to last x commits). Thanks again. – umpirsky Jan 04 '12 at 21:11
  • They're the other branches and tags in the repository. See https://github.com/jelmer/dulwich/branches and https://github.com/jelmer/dulwich/tags – jelmer Jan 04 '12 at 22:05
  • If you want to access the X last comnmits in the remote master branch, you probably want sometihng like: commit_ids = repo.get_walker(include=[remote_refs["refs/heads/master"]], max_entries=10) – jelmer Jan 04 '12 at 22:07
  • I get `AttributeError: 'Repo' object has no attribute 'get_walker'` whre repo is `dulwich.repo.Repo` which is strange since it's documented http://www.samba.org/~jelmer/dulwich/apidocs/dulwich.repo.BaseRepo.html#get_walker Also, I would like to get last X commits no matter in which branch it is. So I can track all branches and always get latest commits in time. Thanks. – umpirsky Jan 05 '12 at 10:02
  • There is only `get_graph_walker(self, heads=None)` – umpirsky Jan 05 '12 at 12:08
  • get_walker() is fairly new - if you are running an older version of dulwich, you want .revision_history(remote_refs["refs/heads/master"])[:10] – jelmer Jan 05 '12 at 12:46
  • That works, thanks, answer accepted. Now if I want to track all branches, I get only refs that start with 'refs/heads/' and get history from each, sort by date and that's it. Or I can do it in one step with dulwich? Also, one tip, it would be cool if we don't need to worry if repo is already created, just call `Repo.init()` and init it if not exist or create repo instance if exists. Regarding versions, I asked a question in lauchpad yesterday https://answers.launchpad.net/dulwich/+question/183736 Thanks a bunch. – umpirsky Jan 05 '12 at 13:26
  • When I try to get diff https://gist.github.com/1565342 I get https://gist.github.com/1565347. I would also like to get list of full paths to changed files for each commit, and maybe staus (modified, added...). Is that possible? – umpirsky Jan 05 '12 at 13:54
  • You can use repo.object_store.tree_changes() to iterate over the changes per file, like modified/removed/added. – jelmer Jan 05 '12 at 15:33
  • I'm not sure what's wrong in your diff code, it seems correct to me. What version of dulwich is that with? – jelmer Jan 05 '12 at 15:42
  • It happens with both 0.7.1 and 0.8.1, same error for https://gist.github.com/1566513. Is this a bug? Can you reproduce it? – umpirsky Jan 05 '12 at 19:00
  • You're passing in the commit object for the parent, not its tree. You want to pass repo[commit.parents[0]].tree as the second argument to tree_changes. – jelmer Jan 06 '12 at 12:36
  • Aaah, ok. Maybe your snippet in answer for fetching diff should be updated then, it can confuse someone. Thanks. – umpirsky Jan 06 '12 at 13:01