Git find modified files since from a shallow clone

Question

I'm on a CI box running tests. To speed it up, I'm just doing a shallow clone:

git clone --depth 1 git@github.com:JoshCheek/some_repo.git

Assuming all the tests pass, I want to trigger the next step in the pipeline. What to trigger is based on which files changed between the last deployment (ref d123456) and the current ref I just tested (ref c123456). If I had done a normal clone, I could find out like this this:

git diff --name-only d123456 c123456

But my clone is shallow, so it doesn't know about those commits. I see that I can use git fetch --depth=n to get more of the history, but I only know the SHA, not the depth of the SHA. Here's a set of ways that could presumably answer this question:

# hypothetical remote diff
git diff --name-only origin/d123456 origin/c123456

# hypothetical ref based fetch
git fetch --shallow-through d123456
git diff --name-only d123456 c123456

# hypothetical way to find the depth I need
depth=`git remote depth-to d123456`
git fetch --depth "$depth"
git diff --name-only d123456 c123456

Otherwise it seems like I might have to write a loop and keep invoking --deepen until my history contains the commit. That seems painful (meaning annoying to write / maintain) and expensive (meaning slow, remember that the purpose of the shallow clone is to reduce this cost).

score 3 · Accepted Answer · answered May 04 '17 at 23:13

3

Otherwise it seems like I might have to write a loop and keep invoking --deepen until my history contains the commit. That seems painful ...

It is painful (and slow, as you note a bit later).

Modern Git (since version 2.11) does have a new git fetch option:

--shallow-exclude=<revision>

Deepen or shorten the history of a shallow repository to exclude commits reachable from a specified remote branch or tag. This option can be specified multiple times.

I have not tried this; it's not clear if it allows a hash ID (the tests use names) and in any case you would specify the parent(s) of the commit you want to deepen through, rather than the commit you want to obtain. But it might suffice.

(I really think a better method is to keep reference clones you can borrow-from.)

answered May 04 '17 at 23:13

torek

448,244
59
642
775

1

Oh, nice catch, I totally missed that! Sadly, it seems Github doesn't support it, when I try it says `fatal: Server does not support --shallow-exclude` :( – Joshua Cheek May 04 '17 at 23:40
1

Can you go into more detail on that last note? It's not clear to me what a reference clone is (are you saying a fully cloned repo that is cached on the CI server?). – Joshua Cheek May 04 '17 at 23:40
Yes: with a reference clone, you run `git clone --reference [options] ` and Git calls up the other Git at the URL as usual, but then borrows or copies (see `--dissociate`) the objects from the reference clone rather than copying them across the network. Measured on a real project, I trimmed the clone wall-clock time from nearly two hours to just a few minutes by using reference clones. (This involved a number of fairly large repositories.) – torek May 04 '17 at 23:56
Thanks, I'm going to try keeping a cached clone, so far it's looking promising. – Joshua Cheek May 08 '17 at 21:34
1

2021 update: Github supports shallow-exclude now, but you can't use a commit hash as an argument (you get the error "the remote end hung up unexpectedly" if you try) – Alice Purcell Feb 11 '21 at 12:54

ElpieKay · Answer 2 · 2017-05-05T04:08:05.717

There are several possible solutions to reduce the clone time and space besides shallow-clone.

1.git clone <url> -b <branch> --single-branch

This fetches only the data reachable by <branch>. Not so effective as --depth=1 but still better than a full clone. It works fine when the repo has many diverged branches.

2.git init;git fetch <url> <tag>

Similarily it fetches only the data reachable by <tag>.

3.Create and use a mirror repo.

git clone <url> --mirror -- /foo/mirror. /foo/mirror is the mirror repo. Suppose your CI system starts multiple instances simultaneously. Clone each via git clone <url> --reference=/foo/mirror -- <instanceN>. In each clone, only the data that can not be found in the mirror repo will be downloaded from the remote repo. You could delete instances to save the space when a job is done. But just keep and update the mirror repo by git fetch regularly based on the update frequency of the remote repo. Once a day in the mid-night, or once a week on Sunday for example.

4.Use git worktree.

Make a clone, keep it and update it first when each CI instance starts. Use git worktree to checkout revisions into different working trees for each instance.

Thanks for the ideas! I'm ultimately going with a cached clone. Might need to use some of the ideas from the others here, if its state gets wonky, but I'm hopeful for now. — Joshua Cheek, May 08 '17 at 21:36

score 0 · Answer 3 · answered Nov 29 '18 at 17:08

0

I hit the same problem and used this

git clone --shallow-since=<date>

I had to store not only the SHA of my last deployment but the date of my last deployment, but otherwise worked great.

answered Nov 29 '18 at 17:08

Robert Antonucci

838
1
8
17

Git find modified files since from a shallow clone

3 Answers3

Linked