0

I have read multiple times that GitHub reduces waste by NOT storing EVERY file in clones, but only the files that change. Source

How does it do this or how can I replicate this type of feature, as I was not able to find this feature in Git?

Note that I don't mind using other VCS that has this functionality.

bahrep
  • 29,961
  • 12
  • 103
  • 150
Yahya Uddin
  • 26,997
  • 35
  • 140
  • 231
  • They are co-hosting forks on the same machine and use a shared repository storage backend. – Thilo Sep 12 '16 at 00:41
  • Related (but there was better discussion about it here, too, cannot find it right now): http://stackoverflow.com/questions/11974686/explanation-of-github-fork-and-how-they-store-files – Thilo Sep 12 '16 at 00:44
  • 2
    Blog from GitHub team about their storage backend (which is just Git basically): http://githubengineering.com/introducing-dgit/ – Thilo Sep 12 '16 at 00:47
  • @Thilo this is interesting but I don't understand how this explains how GitHub avoids storage wastage by storing only the files that changes in forks – Yahya Uddin Sep 12 '16 at 01:12
  • Git already does that. In a branch, only changed files ("objects") need to be stored. A fork can be just a clone on the same server, with the same object repository attached. That way, a fork is essentially the same as a branch (as far as storage is concerned). – Thilo Sep 12 '16 at 01:18
  • @Thilo So your saying that clones should be stored on the server internally on the server as a branch? – Yahya Uddin Sep 12 '16 at 01:23
  • 1
    Not as a branch. But you can register a common object storage for all your git repositories on the same server. My understanding is that this is what Github does. http://stackoverflow.com/questions/23304374/what-are-the-differences-between-git-clone-shared-and-reference – Thilo Sep 12 '16 at 02:41
  • @Thilo What do you mean by common object storage? Do you mean use `git clone --shared`? – Yahya Uddin Sep 12 '16 at 11:41
  • 1
    All I am saying is that you can have more than one repository share the same disk space for their objects. This functionality is already provided by git core. – Thilo Sep 12 '16 at 12:09
  • @Thilo Thanks. Please summarise your comments as an answer. – Yahya Uddin Sep 12 '16 at 15:44

1 Answers1

2

The Github team has put up a fairly detailed article about their storage layer.

An interesting aspect is that they do not suffer from NIH-syndrome, but based all their work on top of existing functionality in the core git system.

Perhaps it’s surprising that GitHub’s repository-storage tier, DGit, is built using the same technologies. Why not a SAN? A distributed file system? Some other magical cloud technology that abstracts away the problem of storing bits durably?

The answer is simple: it’s fast and it’s robust.

So, to come back to your question:

You can (in core git) configure a shared object storage location to be used together by more than one repository.

Github makes use of this by co-locating forks of repositories on the same server (or set of servers, for redundancy and availability). As a result, any duplicate objects (and there will be many in a fork) will need to be stored just once.

Community
  • 1
  • 1
Thilo
  • 257,207
  • 101
  • 511
  • 656