247

During the first clone of a repository, git first receives the objects, and then spends about the same amount of time "resolving deltas". What's actually happening during this phase of the clone?

Rob Bednark
  • 25,981
  • 23
  • 80
  • 125
Nik Reiman
  • 39,067
  • 29
  • 104
  • 160

3 Answers3

152

The stages of git clone are:

  1. Receive a "pack" file of all the objects in the repo database
  2. Create an index file for the received pack
  3. Check out the head revision (for a non-bare repo, obviously)

"Resolving deltas" is the message shown for the second stage, indexing the pack file ("git index-pack").

Pack files do not have the actual object IDs in them, only the object content. So to determine what the object IDs are, git has to do a decompress+SHA1 of each object in the pack to produce the object ID, which is then written into the index file.

An object in a pack file may be stored as a delta i.e. a sequence of changes to make to some other object. In this case, git needs to retrieve the base object, apply the commands and SHA1 the result. The base object itself might have to be derived by applying a sequence of delta commands. (Even though in the case of a clone, the base object will have been encountered already, there is a limit to how many manufactured objects are cached in memory).

In summary, the "resolving deltas" stage involves decompressing and checksumming the entire repo database, which not surprisingly takes quite a long time. Presumably decompressing and calculating SHA1s actually takes more time than applying the delta commands.

In the case of a subsequent fetch, the received pack file may contain references (as delta object bases) to other objects that the receiving git is expected to already have. In this case, the receiving git actually rewrites the received pack file to include any such referenced objects, so that any stored pack file is self-sufficient. This might be where the message "resolving deltas" originated.

araqnid
  • 127,052
  • 24
  • 157
  • 134
  • Is this delta compression more than storing multiple objects in one zlib data stream? – fuz Mar 04 '15 at 23:33
  • 1
    @FUZxxl yes, it's using an algorithm like diff or xdelta to compare two blobs and produce an edit script – araqnid Mar 05 '15 at 22:37
  • 1
    @brooksbp: Only with limitations. Because object with id 103fa49 might need df85b51 to be decoded, but when you receive 103fa49, df85b51 is not there yet (pack files are strictly ordered by sha1 hashes). So, for everything that references only stuff that's already there, things are easy, but for everything else, you'll have to wait until it's received. And this delta compression can be nested, so 103fa49 may need 4e9ba42 which in turn need 29ad945 which in turn needs c9e645a ... you get the picture. [yes, I noticed it's been >4 years ;)] – Bodo Thiesen Jun 24 '17 at 17:55
  • 2
    @brooksbp: Turns out, I was wrong, the pack file does NOT need to be sorted by sha1 hashes. Also, when writing, git writes needed objects prior to objects needing them. So, actually you should be able to parallelize it. Only disadvantage that remains: Because you don't know which objects you will need later, you'll have to recreate some over and over again. See here: https://www.kernel.org/pub/software/scm/git/docs/technical/pack-heuristics.txt – Bodo Thiesen Jun 24 '17 at 21:39
  • 2
    @BodoThiesen that is a rather entertaining read. Most of it went over my head, but I gained a new favorite spoonerism: "In one swell-foop..." – cambunctious Sep 24 '19 at 18:53
63

Git uses delta encoding to store some of the objects in packfiles. However, you don't want to have to play back every single change ever on a given file in order to get the current version, so Git also has occasional snapshots of the file contents stored as well. "Resolving deltas" is the step that deals with making sure all of that stays consistent.

Here's a chapter from the "Git Internals" section of the Pro Git book, which is available online, that talks about this.

jthill
  • 55,082
  • 5
  • 77
  • 137
Amber
  • 507,862
  • 82
  • 626
  • 550
  • 109
    This answer is incorrect. It seems to describe how Mercurial works, not Git. It is coming up in Google searches for this issue so I feel the need to reply. Git does *not* store the differences between commits as deltas; Git is a "whole object" store. As such, Git does not need "snapshots" to show any given file because file history does not need to be reconstructed from deltas. That is how Mercurial works. – nexus Jan 15 '13 at 05:53
  • 13
    The only place where delta encoding comes into play is in the pack file which is strictly for compression and transfer -- it doesn't alter how Git "sees" the world. (http://kernel.org/pub/software/scm/git/docs/v1.6.2.3/technical/pack-heuristics.txt) Please see araqnid's answer below for an accurate response. – nexus Jan 15 '13 at 05:56
  • 6
    All "snapshot" means in this context is a full copy of a file state, rather than a delta-encoded version. As you mentioned, Git *does* use delta-encoding in packfiles. No one said that it "alters how Git sees the world"; please stop projecting your own assumptions. – Amber Jan 16 '13 at 17:04
  • 2
    Your answer is still inaccurate. "Git also has occasional snapshots of the file contents stored as well." -- that is not correct. "'Resolving deltas' is the step that deals with making sure all of that stays consistent." -- that is also not correct, araqnid's response below is correct. – nexus Jan 22 '13 at 21:25
  • 2
    As described in the chapter mentioned above, Git stores the full file content of the latest version always. Previous versions are stored as delta-coded files when they are "loose" files. Periodically (either by calling `git gc` or whenever Git determines it necessary) Git will compress all the "loose" files into a packfile to save space and an index file into that packfile will be created. So zlib will compress with its own delta algorithm but Git does use delta-encoding to store prior versions. Since the most common and frequent access is the latest version, that is stored as a snapshot. – BrionS Apr 24 '13 at 20:08
  • 1
    @nexussays You're simply wrong. Mercurial models and stores changes as diffs between successive versions, Git stores snapshots of each version, that much is true, but Git delta-encodes its snapshot packs. It's not locked into Mercurial's straitjacket. Git's packfile deltas are drawn from any and all likely-looking stored candidates, they've got nothing to do with inter-revision diffs, which are rebuilt on the fly from reconstituted full snapshots. – jthill Dec 02 '19 at 20:39
2

Amber seems to be describing the object model that Mercurial or similar uses. Git does not store the deltas between subsequent versions of an object, but rather full snapshots of the object, every time. It then compresses these snapshots using delta compression, trying to find good deltas to use, regardless of where in the history these exist.

Johan
  • 61
  • 1
  • 7
    Actually, whilst Git can store loose objects they are not necessarily always stored as such - as the loose objects can be deleted and replaced with packed content. I don't think Amber's answer said anywhere anything about subsequent versions. – AlBlue Sep 19 '11 at 00:15