1

I have a few commits of asset metadata that introduced thousands of files (several hundreds of megabytes worth of tiny files). A few times since, the entirety of this metadata has either been replaced or deleted.

Knowing that some of these past commits are no longer relevant to the current state of the repository.

How can I find a list of commits sorted by the number of files the introduced?

Qix - MONICA WAS MISTREATED
  • 14,451
  • 16
  • 82
  • 145

2 Answers2

3

For any particular SHA you can get a count of number of files added with this, which will print out and count just the added files by using the diff filter of A for Added Files only.

numFiles=$(git diff --name-status --diff-filter=A ${sha}^! | wc -l)

If you wrap that in a simple script you can print out a list of SHAs with associated files, which you can pipe to sort. Specify START and END SHAs to limit your results.

#!/bin/sh

for sha in $(git rev-list ${START_SHA}..${END_SHA})
do
   numFiles=$(git diff --name-status --diff-filter=A ${sha}^! | wc -l)
   echo "${numFiles} ${sha}"
done
Andrew C
  • 13,845
  • 6
  • 50
  • 57
1

Fundamentally, each commit is (or "has") an stored tree that is independent of every other commit, so to get "files added by a commit" you must compare (i.e., diff) that commit against some other commit.

For many/most commits it's easy to choose the other commit: use the commit's (single) parent commit. For merge commits (those with two or more parents) the answer is less obvious and I don't know what you will want to do for these.

For a root commit (a commit with no parent), you can still get the number of files added with respect to an empty tree, by diffing against git's "well known, if poorly advertised, empty tree". Or, you might choose to ignore root commits entirely (which simplifies your task).

There's no single git command that will do everything for you here, but it's easy to put together a script or pipeline that will do the trick. The main thing to know is that you will use git rev-list to generate all the candidate commit IDs:

git rev-list --min-parents=1 --max-parents=1 HEAD

will, for instance, get you a list of every commit reachable from HEAD that has exactly 1 parent (i.e., is neither a merge commit nor a root commit). It's up to you to decide whether this is the set of commits you'd like to inspect.

If it is, we're now in pretty good shape since we can simply git diff each such commit against its (single) parent:

git rev-list --min-parents=1 --max-parents=1 HEAD | \
while read sha1; do \
    ...
done

Now the trick is to get git diff to give us the number of files added, perhaps with a bit of help from another command. This is pretty easy because git diff has --name-status and --name-only options, and also a --diff-filter option. Using --name-status will get you output like this:

$ git diff --name-status 0df0541bf13723658d31b8d1376b505b710e63c6^ \
  0df0541bf13723658d31b8d1376b505b710e63c6
A       Documentation/RelNotes/2.4.5.txt
M       Documentation/git.txt
M       GIT-VERSION-GEN
M       RelNotes

Adding --diff-filter=A eliminates all but the Added files, after which we don't really need --name-status (not that it hurts either) since just the name alone, --name-only, will tell us which files were added when comparing these two commits:

$ git diff --name-only --diff-filter=A \
  0df0541bf13723658d31b8d1376b505b710e63c6^ \
  0df0541bf13723658d31b8d1376b505b710e63c6
Documentation/RelNotes/2.4.5.txt

Running this output through wc -l gets a count of lines, which is also a count of files, since each file name is on its own line.1

So, now we have a script that looks like this (I'll leave the backslashes out now):

git rev-list --min-parents=1 --max-parents=1 HEAD |
while read sha1; do
    echo $(git diff --name-only --diff-filter=A ${sha1}^ ${sha1} | wc -l) $sha1
done

The output of this script can then be passed to sort -rn, for instance.

You may wish to tweak these somewhat, depending on what you need to do with merges. You might also want to defeat rename-detection on the git diff commands (or maybe not, it really does depend on how you're using this).


1Ignoring the possibility of having a newline embedded in a file name, anyway. If you want a really general purpose tool you should consider this possibility, but you can probably ignore it for your case.

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775