0

How does one export a remote git repository to a local space, only taking the head revision of a given branch, and then for each exported file, acquire the commit-id of that file?

What I have tried so far

Execute this:

git clone {gitUrl} {repoDir} --branch {branch}

And then for each thusly exported file (ignoring .git and contents), execute this:

git rev-list -1 HEAD {file}

... where the following place-markers are defined this way:

  1. {gitUrl} is the http url of the repository. User credentials can be embedded.
  2. {repoDir} is the path of the export in your local system.
  3. {file} is the full path of the exported file that you are extracting the commit id for.

While this works, the problem is that it is too slow and inefficient. The git clone operation includes the whole history of the repo for that branch. Whereas we are only interested in the HEAD version and its meta data.

Alternatively, we could perform the export with:

git clone {gitUrl} {repoDir} --branch {branch} --depth 1

This is more efficient, as it just pulls down the HEAD version. But the problem with this, is that the subsequent git rev-list -1 HEAD {file} command will return the commit-id of the HEAD as a whole, and not the file's commit-id.

Can I have my cake and eat it too?

Sean B. Durkin
  • 12,659
  • 1
  • 36
  • 65

1 Answers1

1

Can I have my cake and eat it too?

The short answer is no.

Long

Technically, the commit ID of each file in the HEAD commit is the hash ID you get with git rev-parse HEAD (or the longer but equivalent git rev-list command you're using). That's because each commit contains a full snapshot of every file that Git knows about.

What you are getting when you use git rev-list or git log or, at a per-line-in-one-file, git blame command to look backwards in history is not the commit hash ID of the file in question, because that's trivial. Instead, it's the commit hash ID of some earlier commit that contains the same file or, for git blame, same line.

That is, suppose we have, in our Git repository, a simple linear history with just five commits in it. We can draw these five commits like this:

A <-B <-C <-D <-E   <--master

where each uppercase letter stands in for an actual commit hash ID. The branch name, in this case master, serves to let us find the actual hash ID of commit E, since it looks random, and is difficult or sometimes impossible to find otherwise.

Commit E, of course, contains a full snapshot of every file, as of the form it had when we—or whoever—made commit E. It also contains the hash ID of earlier commit D. Git calls D the parent of commit E.

But commit D also has a full snapshot of every file as of the form it had when someone made D, and a link back to its parent C. This repeats for C and so on, back throughout history (which ends when we hit A, which has no parent commit).

What we'd like, in this case, is to have Git compare the snapshot of some file—README.md, main.py, or whatever—that appears in commit E with the one that appears in its parent commit D. If these two snapshots are the same, we'd like to have Git compare D's with C's. If those are the same, Git should keep working backwards. It should do this until it either runs out of commits at A, or the comparison shows that the two files are different.1

In other words, we're repeatedly executing a simple comparison operation:

  • Is file F the same or different in commits X and Y?

for each parent/child pair of commits. As soon as the answer is "yes, it's different", we have Git stop going backwards through history and print the hash ID of the commit it's reached at this point. (The internal storage format, which de-duplicates files across commits, makes this really easy. With git blame, the computation is considerably harder and fancier, but it amounts to the same thing, just on a line-by-line basis.)

In order to do this, though, Git must have access to each of the commits that it needs to traverse as it walks backwards through history. History, in Git, is the set of commits in the repository. Git must have the history to use the history.


1A simple and expedient trick, which Git actually does use, is that when we hit the parent-less (orphan?) commit A, it can simply pretend that there is a totally empty commit before A. Then every file in A is new, and therefore different from its virtual/fake parent. This is why every Git repository includes the empty tree.

torek
  • 448,244
  • 59
  • 642
  • 775