0

How can I enumerate within a given repo branch the set of distinct files added to a given subdirectory since a certain commit?

Motivation: I'm trying to inspect a local copy of a public repo's source tucked away in our private repo as a third party subfolder. I'd like to ascertain what if any new files were added to that directory of our repo dedicated to the third party public repo.

jxramos
  • 7,356
  • 6
  • 57
  • 105

3 Answers3

1

If I understand your question correctly, what you are looking for is actually just a simple diff:

git diff your_commit:nested/path/to/dir third_party_commit:path/to/dir

will tell you all differences between your directory at the specified commit and the directory in the third party commit. If you are only interested in the names (and status) of changed files, you can use the --name-status flag to git diff. --stat might also be useful, depending on your exact use-case.

knittl
  • 246,190
  • 53
  • 318
  • 364
1

This question is either fairly hard, or ridiculously easy, depending on just what you mean:

git diff --name-only --diff-filter=D <hash-of-B> <hash-or-branch> -- <dir>

Remember that Git stores commits, rather than files. Each commit contains files, but each commit is otherwise an independent snapshot of all files—well, all files that are in that snapshot, but that's kind of a redundant and useless way to put it. Let's just consider a tiny repository with three commits that we'll call A, B, and C so that we don't have to deal with big ugly hash IDs:

A <-B <-C   <--master

The branch name master holds the hash ID of commit C. In our case we can just look at all the commits and see that C is obviously last, but in a real repository, with thousands of random-looking hash IDs, it's too hard, so we need someone to hold the hash ID of the last commit.

Commit C has an author name-and-email and time-stamp, a committer name-and-email-and-time-stamp, a log message, and so on. It also holds the hash ID of commit B, so that we can go from C to B. And, C holds all the files you want git checkout to put into your work-tree when you git checkout master.

Meanwhile B has author and committer and log message and so on, and holds the hash ID of its previous commit A. For its snapshot, B holds all the files you want git checkout to put into your work-tree when you git checkout <hash-of-B>.

Commit A has the usual author/committer/log metadata. It says that there is no earlier commit, so that git log can stop logging earlier commits for instance. And for its snapshot, A holds all the files you want git checkout to put into your work-tree when you git checkout <hash-of-A>.

So: suppose you have picked out some historical commit, such as B, from a slightly bigger repository, with two branches master and develop and seven commits we'll call A through G, arranged like this:

        D--E   <-- master
       /
A--B--C
       \
        F--G   <-- develop

You want to know what's different between B and ... well, this is where it gets interesting. What does

beyond a certain commit

actually mean? From B, we can go to C, if we work in the opposite direction of Git's own internal arrows. But from C, we can go to either D or F, and from there to E (if we went to D) or G (if we went to F). You need to pick a direction.

Having picked a direction—"in the direction of the tip of develop", for instance—is it OK to just compare commit B directly to commit G? Both are complete snapshots. Suppose B has files TODO, d1/f1 and d2/f2 (a total of 3 files), and G has files d1/f1, d2/f2, d2/f3, and d3/f4 (4 files). You can then run:

git diff --name-status <hash-of-B> <anything-that-finds-commit-G>

and Git will tell you that to change commit B to match commit G, you'd have to add (A) files named d2/f3 and d3/f4. It might also tell you that you'd have to modify (M) d1/f1 and it would definitely tell you that you have to delete (D) file TODO.

Add to the --name-status a --diff-filter to make it print only the names of files that have some particular desired status-es. For instance, if you want to know which files to delete and which ones to add, use --diff-filter=AD. Git won't mention the files that need to be Modified, only those that need to be Added or Deleted.

Replace --name-status with --name-only to keep the same output as before, minus the status letter. Now you'll see TODO, d2/f2, and d3/f3, without known that TODO should be deleted. Change --diff-filter to select only D files, and you'll no longer see TODO: the dropped status letter is no longer important.

Now all you need to do is limit the output to just those files whose name starts with d2/. To do that, tell git diff to list only such files, by adding the pathspec d2 (you can write it as d2/ or just d2: if there are files named d2/f1 and d2/f2 there is no file just named d2: your OS can't hack that so Git won't store that).

But what if, after commit B—say, in C or D or E—someone added some file, and then removed that file again in commit G? The above git diff won't tell you that. If you want to know that, your job is harder. You're going to have to look at every commit along the path from B to G.

What if "beyond a certain commit" means down every path, from B to G but also from B to E? Then you'll have to look at all of those commits.

You must answer these questions for yourself, then choose how to diff.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Definitely a lot of subtleties I didn't consider. My thoughts were constrained to the single path diff along the repo. Didn't know about the `--diff-filter` that is a great argument. I should have listed out a concrete set of git commands to unambiguously communicate the information I was going for. – jxramos Aug 28 '19 at 23:38
0

I found a bit of a roundabout way using shell commands inspired by the following two answers: How to list commits since certain commit? and git: list all files added/modified on a day (or week/month…) .

My example is taken from one of my public repos.

$:DataApp--ParamCompare jxramos$ git log --name-status --pretty=format: 2f3a5c92b8d5ce31c45a6976f2e6cfa8ac79976f...HEAD | sort | uniq | grep ^A 
A   templates/data_explore_page.html
$
$
$:DataApp--ParamCompare jxramos$ git log --name-status --pretty=format: 825faef1097207479f968c6a5353e41612127849...HEAD | sort | uniq | grep ^A 
A   templates/data_explore_page.html
A   templates/plot_page.html
A   test/test_data/1.x.csv
A   test/test_data/1.y.csv

UPDATE

We can avoid the grep by adding the --diff-filter=A option. The uniq is redundant as well since file addition is a one time affair for the most part. That gives us

git log --name-only --diff-filter=A --pretty=format: <commitHash>...HEAD | sort | grep .
jxramos
  • 7,356
  • 6
  • 57
  • 105