Can Git store file containers as trees and blobs?

Question

Git is a content-addressable file system, and it features three types of objects: blobs, trees and commits. In principle, container file formats like ZIP can be interpreted as a similar concept as a single file (or a link) containing a tree in Git terms. Whilst ZIP files and other types of containers do not have any special handling in Git, these containers are just stored as blobs.

For example, let's say I have a ZIP file with a few files with their timestamps (timestamps not handled by Git), empty directories, and having such a ZIP container in a Git repository may be considered required (possibly precompiled JAR files, often edited OpenOffice documents, etc). Now, let's consider the container is getting modified slightly. This would create a quite another blob from Git's perspective therefore making the repository grow drastically as long as the container gets modified repeatedly. I came across an interesting clean/smudge filter that does a similar thing, but it destroys the original ZIP on smudge overwriting the original ZIP erasing original entry timestamps, possibly ZIP comment and whatever else ZIP containers can have (+ as far as I understand, it makes bare repositories hard to use because they don't contain "cleaned" ZIP containers that are only created on checking out), therefore that filter makes little interest to me.

I'm wondering, is it possible to tell Git to store (possibly ZIP) containers as Git first-class citizens like trees and blobs internally? I guess, it does not support such a case, though.

Update 1

I was wrong, as people say below, there are four object types in Git: I missed tag objects. However, I thought they are built on top of commits like notes (probably) do.

Git does not support this out of the box, so no, there is no current way to just *tell* git to do this. — Lasse V. Karlsen, Oct 24 '19 at 07:05
Side note: there's a fourth object type, *tag* (or annotated tag). This is distinct from a `refs/tags/` name-space reference, which is also called a tag. — torek, Oct 24 '19 at 07:17
I think you may consider some git extension which will do something like the following: Whenever you got ZIP file, you 1) store it as a blob to master branch; 2) create the branch with name of ZIP file; 3) extract files from ZIP to that branch; 4) update meta file in root of master which maps ZIP and branch. So, it will be something like `git zip ...` namespace of commands. — 0andriy, Oct 24 '19 at 07:44
@torek I believed even annotated tags are built on top of commits. Thanks! — terrorrussia-keeps-killing, Oct 25 '19 at 07:29
I'm not in-the-know for Java, but aren't `*.jar` files the result of "_compiling_"? and as such, shouldn't really be stored in the repository anyway? (i.e: use `.gitignore`)... I appreciate this doesn't help for documents, but then maybe it would be more useful to use a different format like [Markdown](https://en.wikipedia.org/wiki/Markdown)...? So, try not to store generated files, and generate HTML/PDFs from text-based markup sources... not necessarily an easy thing to put into action. — Attie, Oct 25 '19 at 07:37
@Attie I used `jar` files as an example container. There are, however, some cases when storing JAR files would be justified: `wsimport` can produce pre-compiled `class` files out of WSDL from remote hosts, and storing `class` files is reasonable in order not to make the build process depend on an external system (assuming that artifact repositories like Apache Maven are not an option; WSDL versioning can be broken; etc). Packing the classes into a JAR file would be a single file rather than dozens of generated binary files including original timestamps. — terrorrussia-keeps-killing, Oct 25 '19 at 08:10
@fluffy: a tag name, such as `refs/tags/v2.1`, will typically point either to a commit object, or to a tag object. (It *can* point to a tree or blob, unlike a branch name like `refs/heads/master`, which is constrained to point to *only* a commit object.) If a tag name points to a tag object, the tag is an "annotated tag' and carries extra data, potentially including a PGP signature. The tag object in turn points to some other object—usually a commit, but again any of the four object types are legal here. — torek, Oct 25 '19 at 12:52

LeGEC · Accepted Answer · 2019-10-24T08:33:58.390

Most of the commands in git do expect to find one of the 4 words blob, tree, commit or tag at the beginning of each object, it will be close to impossible to add a new object type.

Here is a manual experiment :

# I created an object with a new type 'foo' :
$ cat .git/objects/70/c52a28ff2b01f46ccc0cdd03c61c569fd6fd54 | pigz -dz; echo
foo10.abcdefghij    # the '.' is actually '\0'

# all regular git commands start with a "unable to parse header of [object]" :
$ git show 70c52a28ff2b01f46ccc0cdd03c61c569fd6fd54
error: unable to parse 70c52a28ff2b01f46ccc0cdd03c61c569fd6fd54 header
error: unable to parse 70c52a28ff2b01f46ccc0cdd03c61c569fd6fd54 header
fatal: loose object 70c52a28ff2b01f46ccc0cdd03c61c569fd6fd54 (stored in .git/objects/70/c52a28ff2b01f46ccc0cdd03c61c569fd6fd54) is corrupt

$ git fsck
error: unable to parse header of .git/objects/70/c52a28ff2b01f46ccc0cdd03c61c569fd6fd54
error: 70c52a28ff2b01f46ccc0cdd03c61c569fd6fd54: object corrupt or missing: .git/objects/70/c52a28ff2b01f46ccc0cdd03c61c569fd6fd54
Checking object directories: 100% (256/256), done.

# etc ...

A possibility would be to write a more complete smudge/clean filter, which would not only store the zip actual content, but all of the extra data (such as timestamps, comments ...)

Here is one first idea :

if archive.zip contains a dir\file.txt :

create a tree named dir
store the directory header in a blob with a known name (dheader for example)
store the header and the content for file.txt in two distinct blobs (hfile.txt and _file.txt for example)
etc for other zip metadata

using distinct prefixes should allow you to have a clear separation between each type of data you need to store

A second one would be :

manage to pack all of the arhive's metadata in one single blob

etc ...

The clean filter would then have enough data to rebuild the same archive.

Note that "rebuilding the zip file" would require the clean filter to implement all possible features of a zip archive (e.g : being able to compress in all known formats, ...)

Thanks, nice idea! I didn't know that Git is very strict about object types (how else, anyway?) -- therefore it doesn't seem to extensible. The clean/smudge filter would make a Git repository a sort of an disassembled raw assembly playground possibly containing compressed blobs only in order not to support all possible encryption schemas (comments, timestamps, empty directories are easier to keep). So, if I understand correctly, adding a new "container" object type would require upgrading both local and remote Git versions, right? — terrorrussia-keeps-killing, Oct 25 '19 at 07:43

Can Git store file containers as trees and blobs?

Update 1

1 Answers1

Linked