1

Suppose you have new git repository and add a file README.MD and write

foo

to the file and commit this file for the first time. From what i understand is, git creates three new objects when committing: a blob, a tree and a commit. The commit object references a tree which in turn references trees or blobs.

Suppose you do a second commit, adding bar to the README.MD, so that the file looks like this:

foo
bar

and commit this file, a new blob is created for that commit. Does that new blob hold

foo
bar

or only the last change:

bar

?

David
  • 2,926
  • 1
  • 27
  • 61
  • 1
    [knittl has the right answer](https://stackoverflow.com/a/61279157/1256452), but I wonder: why do you care? If it's because the blobs take up a lot of space, note that objects are eventually *packed* into a single pack file. At this point, Git employs delta compression. – torek Apr 17 '20 at 19:50

2 Answers2

1

A blob contains the full content of the file. Blobs are stored zlib compressed and when extracted comprise the literal characters "blob", a single byte, the blob's length represented in ASCII, a single null byte, and finally the file's content.

You can try it out: git cat-file blob-hash

… or, if you don't trust git cat-file to only print the blob's content and nothing else, you can extract a blob's content directly from the command line, e.g.

$ printf 'A' > file
$ git add file
$ xxd .git/objects/8c/7e5a667f1b771847fe88c01c3de34413a1b220
00000010: 7801 4bca c94f 5230 6470 0400 0be4 0232   x.K..OR0dp.....2
$ pigz -d - < .git/objects/8c/7e5a667f1b771847fe88c01c3de34413a1b220 | xxd
00000000: 626c 6f62 2031 0041   blob␣1␀A

Git also employs something called "pack files" which pack multiple objects (blobs, trees, commits) together and delta compresses them. There are heuristics involved to bring more similar objects closer together so that they can be delta-compressed more efficiently. This happens transparently at the storage level. Conceptually, a blob still contains the full content of a file.

knittl
  • 246,190
  • 53
  • 318
  • 364
  • This is not quite true. It is only true some of the time. See [this answer](https://stackoverflow.com/a/61279676/8910547) – Inigo Apr 17 '20 at 19:58
  • @Inigo the question was about a "blob". A "blob" in Git always describes the full content of a file (blob header, space, content length, null-byte, content). Git _pack files_ is a different level of abstraction and doesn't affect "blob"s. Pack files are a storage mechanism which perform compression of related git objects to save space and provide efficient access. Somewhat comparable to TCP, TLS and HTTP. A TCP connection might send and receive multiple HTTP requests and responses, while TLS encrypts the transport layer. All of them can be used together. – knittl Apr 17 '20 at 20:23
  • Please see the second quote from the official *Git Pro* book in my answer, in particular the highlighted part. Also it's clearly implied in the question that the interest is in the underlying storage of a blob, not the abstract definition of a blob (or to use your analogy, the TCP not the HTTP layer). – Inigo Apr 17 '20 at 20:49
1

The correct answer is (c): recent blobs contain the full (compressed) content of the file they represent but older versions of the file can be moved into packfiles:

When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space. The git verify-pack plumbing command allows you to see what was packed up:

A little later, explaining a detailed example:

Here, the 033b4 blob, which if you remember was the first version of your repo.rb file, is referencing the b042a blob, which was the second version of the file. The third column in the output is the size of the object in the pack, so you can see that b042a takes up 22K of the file, but that 033b4 only takes up 9 bytes. What is also interesting is that the second version of the file is the one that is stored intact, whereas the original version is stored as a delta — this is because you’re most likely to need faster access to the most recent version of the file.

There is no way git be able to handle long histories if it always stored each version of each file separately, in whole. It would never have succeeded.

git cat-file as well as all the other git commands that operate on the contents of a file transparently extracts packed files so you don't even have to know this is happening.

See also How does git store files?

Inigo
  • 12,186
  • 5
  • 41
  • 70