0

This morning git fetch took a little longer than usual due to it downloading 206 MB. (Usually it's less than 1 MB as I fetch frequently.) I last fetched this repo a couple of days ago, and there were about 30 branches updated since then. I want to know which branch added the commit containing the large file sizes so I could work with the developer to determine if we should change something before it gets merged into a shared branch (which would lock the large files into the history permanently).

I know we can list large files in a repo, but in this case I'd like to see the list of large files that came in with the most recent fetch. I'm not sure if that's possible after the fetch was already done, but perhaps almost as good would be seeing all large objects that I fetched in the last X days. And if not even that, perhaps I could find all large objects in commits with committer dates in the last X days. (I'm fairly certain the last option is possible with some scripting, though it isn't quite as nice since it's possible someone recently pushed an old commit for the first time.)

Side Note: in this case I glanced at the list of branch names and was able to guess correctly which branch it was. It turns out the developer had accidentally added a commit with many image files, and then realizing the mistake had added another commit which deleted them all. They already had planned to squash those two commits before completing the PR, and simply didn't realize they should have squashed those two commits before even pushing. My immediately need is solved for today, but the next time it happens I'd like to do better than my current answer of just guessing and checking the branches manually.

TTT
  • 22,611
  • 8
  • 63
  • 69
  • Does this answer your question? [How to find/identify large commits in git history?](https://stackoverflow.com/questions/10622179/how-to-find-identify-large-commits-in-git-history) – Makoto Mar 06 '23 at 19:17
  • @Makoto This question is not about large files by themselves, but about the times associated with those files. (committed time or preferably fetched time) – TTT Mar 06 '23 at 19:23
  • @Makoto also, it's kind of funny that the top answer to that question is the same top answer to the one I linked to in my question, and by the same person. They double dipped. ;) – TTT Mar 06 '23 at 19:27
  • It's a bit contrived but using the solution in that question, I can find the path of the file that's largest, then use log searching (`git log -S`) to find when the file was introduced. – Makoto Mar 06 '23 at 19:28
  • @Makoto Note that these files are not big compared to other files in the repo already. (I think there are thousands that are bigger than these already.) These files are likely big compared to files added recently though. One could take the approach of looping through all objects in the repo bigger than X, and matching it to recent commits (or even better, matching them to recently fetched). I'm thinking going the other way is better though- start with all the commits I recently fetched, and find which of those commits have large files in them? – TTT Mar 06 '23 at 19:38

2 Answers2

2

To find what's been updated you can git reflog --remotes --date=short¹

then you can run a diff of the updated ref with and without the reflog selector, so if the most recent origin/main entry looks like it could use some examining you can

git diff --name-status --diff-filter=A origin/main@{2023-02-25}..origin/main

will show you all the files added by last week's surprise Saturday pull to origin/main, tune as needed. git log of that range can show all the commits added by the pull, and so forth.


¹ random note: git reflog without a subcommand defaults to git reflog show, its docs merely hint this but that is interpreted as git log -g --oneline aka git log --walk-reflogs --oneline, you can extract whatever info you like about the commits with all of git log's formatting machinery, the reflog-selector format symbol is %gd.

jthill
  • 55,082
  • 5
  • 77
  • 137
  • nice one. Where is the `--date` option of reflog documented ? I can't seem to find it in `git help reflog` – LeGEC Mar 09 '23 at 03:10
  • @LeGEC It's "Options for show - `git reflog show` accepts any of the options accepted by `git log`.". "show" is the default. – jthill Mar 09 '23 at 03:22
  • ok, and the fact that `--date` displays the date of the reflog entries in the ref name ? is there another hidden feature which allows to write `%(rd)` or something in a format string ? – LeGEC Mar 09 '23 at 03:33
  • 1
    @LeGEC `git reflog` is effectively `git log -g --oneline` aka `git log --walk-reflogs --oneline`, which gives the rules for whether a timestamp or count is shown, the format specifier's `%gd`. Yah, I have no idea when I learned this, or whether if it was by trying random things or picking up details from the manual. I'll edit it in to the answer, too, thanks for the prodding. – jthill Mar 09 '23 at 04:19
  • ok, it is documented in [`git log -g`](https://git-scm.com/docs/git-log#Documentation/git-log.txt--g). I guess you could say it is mentioned in `git help reflog` with: `"see git-log(1) for more information."` :p – LeGEC Mar 09 '23 at 04:52
  • and as far as I see: there is no way to print just the date, you have to somehow extract it from the `@{...}` part of the ref name – LeGEC Mar 09 '23 at 05:08
1

One way could be:

  • guess the branches of "the last fetch",
  • use the reflog to scan the range <previous>..<now> for each of these remote branches.

The tricky part is the first point :

  • if you still have the output of your last git fetch command, you can get the list of the last updated references, and feed that into a loop which can scan the reflog:
# say ref_names.txt contains names like 'origin/master', 'origin/feature1' ...
cat ref_names.txt | while read ref; do
    git rev-list --objects $ref@{1}..$ref
done
  • you may bluntly iterate over all origin/* references, and scan $ref@{1}..$ref -- this may get you too many commits to scan through, but you would be 100% sure to scan all the branches updated by your latest git fetch

  • you may use an api on your central server to spot the actions that updated a branch say in the last two days, and scan those branches only,

  • you may dig into the log files themselves:

a reflog line looks like:

$ tail -1 .git/logs/refs/remotes/origin/master
454dfcbddf9624c129fa7600b3c774b99e36cb43 d15644fe0226af7ffc874572d968598564a230dd User Name <user@email.com> 1678166909 +0400   fetch: fast-forward

the timestamp mentionned after the email is the time that ref was updated on your repo,so it roughly matches the timestamp of the last git fetch which updated this particular remote branch.

Oddly, I haven't found a way to print that value in a formatted way with git log format flags -- not in the documentation at least.

You may still use that information (e.g: go through all log files, look for lines mentioning fetch or pull, and keep the highest timestamp) to guess after the facts when your last git fecth occured, and filter the branches that got updated based on this information.

LeGEC
  • 46,477
  • 5
  • 57
  • 104