3

Github recently introduced a extension to git for storing large files in a different way. What exactly they mean by extension replaces large files with text pointers inside Git ?

Tarun Chabarwal
  • 332
  • 4
  • 15

1 Answers1

11

You can see in the git-lfs sources how a "text pointer" is defined:

type Pointer struct {
    Version string
    Oid     string
    Size    int64
    OidType string
} 

The smudge and clean sources means git-lfs can use a content filter driver in order to:

  • download the actual files on checkout
  • store them in their external source on commit.

See the pointer specs:

The core Git LFS idea is that instead of writing large blobs to a Git repository, only a pointer file is written.

version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
(ending \n)

Git LFS needs a URL endpoint to talk to a remote server.
A Git repository can have different Git LFS endpoints for different remotes.

The actual file is upload to or downloaded from a server which respect the Git-LFS API.

This is confirmed by the git-lfs man page, which mentions:

The actual file gets pushed to a Git LFS API

You need a Git server which implements that API in order to support for uploading and downloading binary content.


Regarding the content filter driver (which exists in Git for a long time, well before lfs, and is used here by lfs to add this "large file management" feature), this is where the bulk of the work happens:

The smudge filter runs as files are being checked out from the Git repository to the working directory.
Git sends the content of the Git blob as STDIN, and expects the content to write to the working directory as STDOUT.

Read 100 bytes.

  • If the content is ASCII and matches the pointer file format:
    Look for the file in .git/lfs/objects/{OID}.

  • If it's not there, download it from the server.
    Read its contents to STDOUT

  • Otherwise, simply pass the STDIN out through STDOUT.

The clean filter runs as files are added to repositories.
Git sends the content of the file being added as STDIN, and expects the content to write to Git as STDOUT.

  • Stream binary content from STDIN to a temp file, while calculating its SHA-256 signature.
  • Check for the file at .git/lfs/objects/{OID}.
  • If it does not exist:
    • Queue the OID to be uploaded.
    • Move the temp file to .git/lfs/objects/{OID}.
  • Delete the temp file.
  • Write the pointer file to STDOUT.

Git 2.11 (Nov. 2016) has a commit detailing even more how this works: commit edcc858, helped by Martin-Louis Bright and signed-off by: Lars Schneider.

convert: add filter.<driver>.process option

Git's clean/smudge mechanism invokes an external filter process for every single blob that is affected by a filter. If Git filters a lot of blobs then the startup time of the external filter processes can become a significant part of the overall Git execution time.

In a preliminary performance test this developer used a clean/smudge filter written in golang to filter 12,000 files. This process took 364s with the existing filter mechanism and 5s with the new mechanism. See details here: git-lfs/git-lfs#1382

This patch adds the filter.<driver>.process string option which, if used, keeps the external filter process running and processes all blobs with the packet format (pkt-line) based protocol over standard input and standard output.
The full protocol is explained in detail in Documentation/gitattributes.txt.

A few key decisions:

  • The long running filter process is referred to as filter protocol version 2 because the existing single shot filter invocation is considered version 1.
  • Git sends a welcome message and expects a response right after the external filter process has started. This ensures that Git will not hang if a version 1 filter is incorrectly used with the filter.<driver>.process option for version 2 filters. In addition, Git can detect this kind of error and warn the user.
  • The status of a filter operation (e.g. "success" or "error) is set before the actual response and (if necessary!) re-set after the response. The advantage of this two step status response is that if the filter detects an error early, then the filter can communicate this and Git does not even need to create structures to read the response.
  • All status responses are pkt-line lists terminated with a flush packet. This allows us to send other status fields with the same protocol in the future.

This has for consequence a warning set in Git 2.12 (Q1 2017)

See commit 7eeda8b (18 Dec 2016), and commit c6b0831 (03 Dec 2016) by Lars Schneider (larsxschneider).
(Merged by Junio C Hamano -- gitster -- in commit 08721a0, 27 Dec 2016)

docs: warn about possible '=' in clean/smudge filter process values

A pathname value in a clean/smudge filter process "key=value" pair can contain the '=' character (introduced in edcc858).
Make the user aware of this issue in the docs, add a corresponding test case, and fix the issue in filter process value parser of the example implementation in contrib.

Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • For people like me who are struggling to understand the basic ideas behind this comprehensive answer, a 'smudge' command runs when code is checked out from a repository; and a 'clean' command runs in the opposite direction, i.e. when code is committed. I think of them as being a bit like hooks that can be used to insert actions into the commit or checkout workflow. As @VonC explains, git-lfs uses them to store a pointer instead of a large file in the commit step. I found the link to git attribute filter drivers particularly useful: (http://schacon.github.io/git/gitattributes.html. – John Dec 06 '20 at 18:26
  • @John Exactly. You can see an example of smudge hook here: https://stackoverflow.com/a/64710964/6309. And clean: https://stackoverflow.com/a/30945624/6309 – VonC Dec 06 '20 at 18:56