Why does git writes objects to remote before the remote resolves deltas

Question

The short question: I find that git writes blob objects to remote before the remote resolves deltas during git push --force, even if the same blob objects were written to the same remote repository a short time ago.

I want to ask:

Why does git writes the static blob objects to remote even the latter has them
Is it possible to stop git from doing this (client side or server side)

The longer story:

I have a repository that contains both static files and code, and I manage them differently.

All the code files are in branch "history", and all the static files are in branch "static", branch "history" and "static" share a common initial commit, and they merge to form branch "master", illustrated below:

*   commit (HEAD -> master, origin/master)
|\  Merge:
| |
| |
| | 
| |     Merge branch 'static'
| | 
| * commit (static)
| |
| |
| | 
| |
| | 
* | commit (origin/history, history)
| |
| | 
| |
| |     
| |
| | 
* | commit
| |
| |
| | 
| |
| | 
* | commit
|/
|
|   
|
| 
* Initial commit

Each time there is a code update, I commit the change in branch "master", then rebase the commit onto branch "history", then checkout branch "history" and emerge branch "static" again, during this process, branch "history" (fast-forward) and "master" (force update) are pushed to remote:

git rebase --onto history origin/master master
commit=`git rev-parse HEAD`
git checkout $history_branch
git reset --hard $commit

git push

git checkout master
git reset --hard history
git merge -m "Merge branch 'static'" static

git push --force

This command executes faster, because it does not transfer the static files to remote, which contain large files.

When there is a change in static files, I checkout the branch "static", commit the change using --amend flag, then checkout branch "history" and merge branch "static", force update branch "master" on remote at the end of the process:

git checkout static
git add .
git commit --amend -m 'Add static files'

# As torek pointed out, I made a mistake in this post
# The following "git push" command is not performed
# git push

git checkout master
git reset --hard history
git merge -m "Merge branch 'static'" static

# git push --force
git push --force origin static master

# torek's suggestion "What you can do about this, part 1"
# does not work out for me:
#
# $ git push --force origin static master
# Counting objects: 422, done.
# Compressing objects: 100% (407/407), done.
# Writing objects: 100% (422/422), 480.08 MiB | 1.05 MiB/s, done.
# Total 422 (delta 41), reused 0 (delta 0)
# remote: Resolving deltas: 100% (41/41), completed with 1 local object.
# To ...
#  + 3539524...6618427 master -> master (forced update)
#  + 6a1f0c0...ba60bb9 static -> static (forced update)

The last command, however, takes a long time to complete, and I find git writes all the static blob objects to remote before the remote resolves the deltas:

Counting objects: 422, done.
Compressing objects: 100% (407/407), done.
Writing objects: 100% (422/422), 480.08 MiB | 1.44 MiB/s, done.
Total 422 (delta 41), reused 0 (delta 0)
remote: Resolving deltas: 100% (41/41), completed with 1 local object.

This happens even if the commands are executed the second time, and no modification to the work tree is performed between the 1st and the 2nd execution.

I used the script in how-does-gits-transfer-protocol-work to list all the objects in the local repository before and after the 2nd execution, and the results shows only 2 new objects after the 2nd execution, the commit objects produced by git commit --amend -m 'Add static files' and git merge -m "Merge branch 'static'" static, which means no new blob objects are created.

Extra Information:

Here is the script that looks after the workflow:

#!/bin/bash

master_branch=master
master_origin=origin/master
history_branch=history
static_branch=static


make_master() {
    git checkout $master_branch
    git reset --hard $history_branch
    git merge -m "Merge branch 'static'" $static_branch
}


extend_history() {
    git rebase --onto $history_branch $master_origin $master_branch
    local commit=`git rev-parse HEAD`
    git checkout $history_branch
    git reset --hard $commit
}


add_static() {
    git checkout $static_branch
    git add .
    git commit --amend -m 'Add static files'
}


case "$1" in
code)
    extend_history
    git push
    make_master
    git push --force
;;
asset)
    add_static
    make_master
    # git push --force
    git push --force origin $static_branch $master_branch
;;
*)
    echo "Unknown action \"$1\"" >&2
    exit 127
esac

client git version: 2.17.1 client os: 18.04.2 LTS (Bionic Beaver), x86_64, VM inside virtualbox 5.2.26 server git version: 2.11.0 server os: Debian GNU/Linux 9 (stretch), x86_64

The local repository was a directory on the client os's local disk, latter moved to a directory inside a virtualbox shared folder, same result.

Edit: After all the trouble I have been through, I decided to take torek's second advice, not rewrite history. If the outdated static files takes too much space, I still need to squash the commits, so I moved all the code files into a subtree, and manage them from there:

git checkout static
git subtree add -P code history
git checkout master
git reset --hard static
# Remove branch static and history, their tracking branches,
# and their counterparts in remote repository

To squash static commits:

code_commit=`git subtree split -P code`
git rm --quiet -r code
git checkout --orphan new
git commit --quiet -m 'Add static files'
git branch -M new master
git subtree add -P code $code_commit

torek · Accepted Answer · 2019-03-14T06:30:46.133

Git could, indeed, do the right thing: it could ask the server Do you have blob H? for some hash H, and if the server already has it, avoid sending it again.

Git doesn't actually do that for a good reason, though. Well, "good" by some measures, anyway. What Git does is ask the server if it has specific commits. It then makes some reasonable, but not necessarily 100% accurate, assumptions based on the results. This will sometimes mean sending an object unnecessarily. And, not entirely incidentally, your code that achieves the pushes does not do what you claim that it does in your explanation before that code. (This is, I think, the source of the problem, but I have not tested that.)

Still, there are some things you can do. Let's take a look at what Git is doing, first.

Details

When there is a change in static files, I checkout the branch "static", commit the change using --amend flag, then checkout branch "history" and merge branch "static", force update branch "master" on remote at the end of the process:
git checkout static
git add .
git commit --amend -m 'Add static files'

At this point, in your own repository, you have:

       R    [static@{1}]
      /
...--o--S   <-- static

(though actually the ... section is empty, and o is commit A below).

Commit R is the one that used to be at the tip of static; it's been shoved aside, with S as the new tip of static. Both commits do exist in your own repository.

git push

You're not doing this step. Thus, the server does not yet have commit S. (Look at the code for case asset, which runs add_static, then make_master, then git push --force. The make_master step sets the current branch to master, so git push --force pushes only master. That's why the git log --graph output does not show origin/static.) If you were doing it you would need to git push --force here.

We now proceed to:

git checkout master
git reset --hard history
git merge -m "Merge branch 'static'" static

git push --force

Let's draw this graph as well, including the shoved-aside previous master@{2} (it's @{2} because we have two intervening events: the reset, then the merge). This graph, reflecting what's in your repository, looks like this:

  R--------M   <-- origin/master, master@{2}
 /        /
A--o--o--L   <-- history, origin/history, master@{1}
 \        \
  S--------N   <-- master

(commit R has the static@{1} label, and S has static and origin/static; I am not including these labels in the drawing for space reasons).

The server, meanwhile, has either this:

  R--------M   <-- master
 /        /
A--o--o--L   <-- history

This is where things get interesting. The client must now determine which objects to send. It does so by initiating a conversation with the server. It starts with: I'd like to send you N; do you have N? Of course, the server does not have commit N since you just made it.

Since the server says no, the client says: Then I need you to have N's parents L and S; do you have those? Of course, they do have L, but not S. The client now knows to send N and S, and that the server has all objects associated with L—and, since the history on the server isn't shallow, all objects that are in the chain reaching from L back to A.

The client now asks if the server has S's parent A, or assumes that it does because A is an ancestor of L; either way it winds up realizing that the server does have A.

The client now makes the assumption that the server has all objects that are in all the commits the server has mentioned. It makes no assumption that commit R exists on the server, as there was no mention of R in the have/want protocol exchanges. So it packages all the objects that are in S, and sends them. The server repacks this, discovers that most of the blobs are redundant, and effectively ignores the redundant blobs.

What you can do about this, part 1

One way to deal with this is to go ahead and set a label on the server corresponding to commit R (at the earlier step). That is, add a git push --force origin static, so that origin has a label static pointing to R.

Then, when sending them a new commit for master, be sure to tell them to update both static and master:

git push --force origin static master

or:

git push origin +static:static +master:master

(these mean the same thing—the plus sign on a refspec sets the force flag for that particular refspec, and in cases like this one I like the explicitness, but you can use whatever syntax you prefer).

Now the server will have:

    ...........<-- static
   .
  R--------M   <-- master
 /        /
A--o--o--L   <-- history

and will advertise the fact that its refs/heads/static denotes commit R. The client needs this information for its pre-push hook (whether or not it actually runs any pre-push hook). So when the client goes to send new commits, it will offer to send S (for updating static and because it's in the history for the updated master) and N (for updating master), but, this time it can tell that the server has R. It should be able to send just the one new blob.

(I am not sure that it will do that, but it should be easy enough to test.)

Note that it's important that you do both of these pushes together, because as soon as the server accepts S as its static and N as its master, it will garbage-collect both M and R. (Servers normally do not have reflogs enabled, and all of these objects are in pack files and hence not subject to the 14-day grace period for loose objects.)

What you can do about this, part 2

Another option is to stop rewriting history at all. You might not like this option because your static-assets-objects will accumulate over time, inflating the repository size. But that would also completely remove the problem since now the client will understand the server's history properly.

In a sense, it's the history rewrite that is causing problems: the client makes the assumption that the server does not have any of the static-assets-objects because each new commit on that branch is totally unrelated to anything except root commit A. This assumption is "safe" in that it just results in sending extra objects. It saves a lot of time because enumerating all the tree and blob objects behind every commit is very slow—it's a lot faster to just say: Aha, the server has this commit, so—except for complications introduced by shallow grafts, which we'll ignore here—it has all the objects implied by having this commit and its history. The client hardly has to offer any hash IDs, as the server soon responds with Yes I have that one already, and that terminates the traversal of that portion of the graph. If the server has L, it has everything before L too. If it has R, it has everything before R.

Well, I should amend that a bit: it would save a lot of time, except for the fact that you're rewriting history so that the client never asks about R. A complete enumeration of all objects, while slow, might be faster than re-sending most objects from commit R. It would certainly save some bandwidth. But for most normal situations, and for Git histories that do not do a lot of rewrites, it's faster to do this the way Git enumerates commits and just assumes things about the trees and blobs behind those commits.

Apart from "What you can do about this, part 1", which did not work out for me, the rest I totally agree: the problem is caused by overwriting history, git can not be blamed for not being optimized for this, and, as you pointed out, **there is no "git push" when updating static files**, I've edited my post to fix this. — coinfaces, Mar 14 '19 at 11:36
@coinfaces Ah, it's unfortunate that pushing both commits together (under two different names) doesn't produce the desired effect. — torek, Mar 14 '19 at 15:12

Why does git writes objects to remote before the remote resolves deltas

1 Answers1

Details

What you can do about this, part 1

What you can do about this, part 2