GIT: Get all git object hashes of blobs added to the repository by a commit

Question

Is it possible to get a list of all git object hashes of blobs which have been added to the repository by a given commit hash using the git command line tools?

I already tried archiving this with the git plumbing tool git-diff-tree. Maybe it's the wrong approach. Below is the best result I could get so far. But the (very long man page) documentation didn't help finding out how exactly the output has to be interpreted.

$ git diff-tree --no-commit-id 2b53d04dbb7cd35d030ddc59b13c0836a87daeb7 
:100644 100644 03f15b592c7d776da37e3d4372c215b14ff8820f 6e0ed0b1ed56e9a35a3be52a9de261c8ffcccae8 M      file1.ts
:100644 100644 b5083bdb9c31005ebd16835a0f49dc848d3f387a 4b7f9e6624a66fec0510d76823303017e224c9d7 M      file2.ts
:100644 100644 368d64862e6aa2a0110f201c8a5193d929e2956d 0e51626a9866a8a3896489f497fbd745a5f4a9f2 M      file3.ts
:040000 040000 c332b1e576af0dbb93cc875106bc06c3de6b74c8 f7f3478a9b0eaac85719699d97e323563a1b102b M      some_folder

Do the first and second git object blob hashes show the old and new objects for the modified file respectively? In the worst case I could fetch that information by parsing the output.

My primary goal was to find a command line which works as below:

$ git <command> <option1> <option2> 368d64862e6aa2a0110f201c8a5193d929e2956d 
6e0ed0b1ed56e9a35a3be52a9de261c8ffcccae8 
4b7f9e6624a66fec0510d76823303017e224c9d7 
0e51626a9866a8a3896489f497fbd745a5f4a9f2

Edit below in response to @torek

In response to the answer of @torek I want to be more clear about what my intentions are because he is absolutely right pointing out that new isn't nececessary new.

I am planning to use git rev-list --reverse <branch> to get a a list of all commits on that branch in commit order. Then I want to visit every commit in this order and collect firstly seen blob hashes on this branch per commit.

The end result should be something like the following:

C:368d64862e6aa2a0110f201c8a5193d929e2956d
B:03f15b592c7d776da37e3d4372c215b14ff8820f
B:4b7f9e6624a66fec0510d76823303017e224c9d7
B:c332b1e576af0dbb93cc875106bc06c3de6b74c8
C:5521a02ce1bc4f147d0fa39a178512476764dd66 
B:e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e
B:adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
C:a3db5c13ff90a36963278c6a39e4ee3c22e2a436
B:4888920a568af4ef2d2f4866e75b4061112a39ea
.
.
.

C: commit B: blob

If this isn't easily done it would be ok to do two passes. In the first pass blobs can be mentioned multipe times in different commits because of reasons you have pointed out:

adding a file with the same content in an other file
a file has the same content after it has been modified

I could then do a second pass piping the file through awk '!x[$0]++' which will remove any duplicates. This wouldn't be very efficient but would get the result I want.

I hope I made my intentions clear now. Any thoughts?

Depending on how much memory you have available in whatever language you're going to write this. you might just run `git rev-list --reverse` to get your list of commit hashes, then, in that programming language, invoke `git ls-tree -r` on each commit and get all blob hashes. If you can hold all blob hashes in an associative array, it's now a simple test: `for h in (ls-tree -r of c) // h is hash, c is commit` `if h in array { not new } else { array[h] = c; is new }` — torek, Oct 18 '19 at 23:19
Note that the order in which commits are visited by `git rev-list --reverse` is pretty loosely specified in the documentation. In nonlinear cases, i.e., when forking at merges during the history walk, `git rev-list` uses a priority queue. The priority depends on the sorting options you specify. The default is to use the committer date; `--topo-order` guarantees that you'll go down each leg of a merge "all at once" without interleaving. — torek, Oct 18 '19 at 23:21

score 1 · Accepted Answer · answered Oct 18 '19 at 21:48

Is it possible to get a list of all git object hashes of blobs which have been added to the repository by a given commit hash using the git command line tools?

Yes and/or no: you have to define precisely what you mean by added to the repository.

Suppose, for instance, that I start with a totally empty repository:

$ mkdir foo && cd foo && git init
Initialized empty Git repository in ...

Now I create README.md and git add it and commit:

$ echo for testing > README.md
$ git add README.md
$ git commit -m initial
[master (root-commit) 19278e9] initial
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

README.md is a blob and its hash ID is:

$ git rev-parse HEAD:README.md
43b18adf702be62761e3affd85c4c3ee5c396be7

Later, I write a new file:

$ echo for testing > newfile.txt
$ git add newfile.txt
$ git commit -m 'add new file'
[master 5521a02] add new file
 1 file changed, 1 insertion(+)
 create mode 100644 newfile.txt

If we look at this commit, we'll see the new file. If we look at it with git show --raw we'll see it in the git diff-tree format:

$ git show --raw
commit 5521a02ce1bc4f147d0fa39a178512476764dd66 (HEAD -> master)
Author: Chris Torek <chris.torek gmail.com>
Date:   Fri Oct 18 14:10:55 2019 -0700

    add new file

:000000 100644 0000000 43b18ad A        newfile.txt

This seems like a blob that's been added to the repository, but wait, there's something awfully familiar about 43b18ad:

$ git rev-parse HEAD:newfile.txt
43b18adf702be62761e3affd85c4c3ee5c396be7

Yes, that's the same hash ID as README.md:

$ git ls-tree -r HEAD
100644 blob 43b18adf702be62761e3affd85c4c3ee5c396be7    README.md
100644 blob 43b18adf702be62761e3affd85c4c3ee5c396be7    newfile.txt

It's one blob, but two files. Is that really newly added?

If your answer to the above is "yes, it's new, even though it's old", that might answer this second question. If your answer is "no, it's not new", what about a commit that reintroduces a blob that was removed in a previous commit? Or, if two commits I and J made in parallel on two branches:

          I   <-- br1
         /
...--G--H
         \
          J   <-- br2

both introduce the same blob, which one actually adds it as all-new, and which one merely duplicates the other?

In general, if you want all new, you'll have to walk the entire commit graph, inspecting each commit's tree (see git ls-tree -r), and select which commits first introduce a blob object ID that is not already in some earlier (parent-wise and/or date-and-time-wise) commit object. If you want "newly added as a new file in this commit", inspect the commit and its parent(s), perhaps using git diff-tree or similar. Note that an all-new file has an all-zero mode in its parent, and a status letter of A (added), while a file modified from the its parent has a status letter of M (modified) and two nonzero hashes. A file nominally deleted—a file that existed in the parent, but no longer does in the child—has a status letter of D (deleted). If you enable rename detection, you'll get R status-es and similarity index values; you may want to disable this, or at least force the similarity testing to 100%.

GIT: Get all git object hashes of blobs added to the repository by a commit

1 Answers1