-2

Let me have a Git repository and a directory with files.

I need to find the commit (if there are several such ones, then the last "minimum changes" commit) in the repository with the minimum amount of changed lines compared (e.g. by diff) to the files in the directory.

The repository may be huge and the number of commits big.

How to do this?

porton
  • 5,214
  • 11
  • 47
  • 95
  • 2
    What have you tried so far? Git provides `git diff` for showing the changes between two revisions, and there are [already questions about counting the number of lines changed in a diff](https://stackoverflow.com/questions/767198/how-to-get-diff-to-report-summary-of-new-changed-and-deleted-lines). – larsks Feb 08 '21 at 17:41
  • 1
    https://stackoverflow.com/questions/2528111/how-can-i-calculate-the-number-of-lines-changed-between-two-commits-in-git/2528129 may be of interest – larsks Feb 08 '21 at 17:52
  • Is the directory under version control already? Is it an external directory? Please provide the relevant information. But I think you have to brute-force your solution: go over all commits, compare all commits, sort, pick best match. What constitutes "changed" to you? Fewest number of files? Fewest insertions? Fewest deletions? Do you want Git to perform rename-detection? – knittl Feb 08 '21 at 18:56

1 Answers1

2

I think you have to brute-force this. It is going to be slow.

git rev-list lists all commits in a range. git diff --shortstat outputs the number of changed lines and files. Unfortunately, you cannot compare commits with external directories, so you either have to commit the contents of the directory once (e.g. on an orphaned/detached branch) or by checking out each commit that you want to test. I assume that having the directory available as commit will be a lot faster (YMMV).

Combine all of that and you get:

git rev-list --all \
  | while read -r commit; do
    echo "$(git diff --shortstat "$commit" "$needle") -- $(git log -1 --pretty="format:%h %ci" "$commit")";
  done | sort -n | head -10

NB.This takes really long to run. If you know beforehand that only a limited set of commits are candidate commits, pass a range to the first rev-list (e.g. branch-a~10..branch-a instead of --all to only compare the latest 10 commits on branch-a).

needle is a variable containing the id (hash) of your orphaned commit which you want to compare. sort -n to perform numeric sort and head -10 to select only the 10 best candidates. You probably want to output the full list (maybe even before sorting) and write it to a file. Then you can sort and pick the best candidate without having to perform all comparisons again.

knittl
  • 246,190
  • 53
  • 318
  • 364