How does GitHub reduce storage waste?

Question

I have read multiple times that GitHub reduces waste by NOT storing EVERY file in clones, but only the files that change. Source

How does it do this or how can I replicate this type of feature, as I was not able to find this feature in Git?

Note that I don't mind using other VCS that has this functionality.

They are co-hosting forks on the same machine and use a shared repository storage backend. — Thilo, Sep 12 '16 at 00:41
Related (but there was better discussion about it here, too, cannot find it right now): http://stackoverflow.com/questions/11974686/explanation-of-github-fork-and-how-they-store-files — Thilo, Sep 12 '16 at 00:44
Blog from GitHub team about their storage backend (which is just Git basically): http://githubengineering.com/introducing-dgit/ — Thilo, Sep 12 '16 at 00:47
@Thilo this is interesting but I don't understand how this explains how GitHub avoids storage wastage by storing only the files that changes in forks — Yahya Uddin, Sep 12 '16 at 01:12
Git already does that. In a branch, only changed files ("objects") need to be stored. A fork can be just a clone on the same server, with the same object repository attached. That way, a fork is essentially the same as a branch (as far as storage is concerned). — Thilo, Sep 12 '16 at 01:18
@Thilo So your saying that clones should be stored on the server internally on the server as a branch? — Yahya Uddin, Sep 12 '16 at 01:23
Not as a branch. But you can register a common object storage for all your git repositories on the same server. My understanding is that this is what Github does. http://stackoverflow.com/questions/23304374/what-are-the-differences-between-git-clone-shared-and-reference — Thilo, Sep 12 '16 at 02:41
@Thilo What do you mean by common object storage? Do you mean use `git clone --shared`? — Yahya Uddin, Sep 12 '16 at 11:41
All I am saying is that you can have more than one repository share the same disk space for their objects. This functionality is already provided by git core. — Thilo, Sep 12 '16 at 12:09

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

The Github team has put up a fairly detailed article about their storage layer.

An interesting aspect is that they do not suffer from NIH-syndrome, but based all their work on top of existing functionality in the core git system.

Perhaps it’s surprising that GitHub’s repository-storage tier, DGit, is built using the same technologies. Why not a SAN? A distributed file system? Some other magical cloud technology that abstracts away the problem of storing bits durably?

The answer is simple: it’s fast and it’s robust.

So, to come back to your question:

You can (in core git) configure a shared object storage location to be used together by more than one repository.

Github makes use of this by co-locating forks of repositories on the same server (or set of servers, for redundancy and availability). As a result, any duplicate objects (and there will be many in a fork) will need to be stored just once.

How does GitHub reduce storage waste?

1 Answers1