Possibility for git "overlays" (storing only differences to extern repositories in a local repository)?

Question

I would like to do something, best described in this mailing list post that I found:

git Archives: GIT overlay repositories (unsw.edu.au)

Start with two repositories, let's call them Repo-A and Repo-B. Repo-A is hosted on some server somewhere and contains lots of code (let's say its a kernel source repository). Repo-B is only adding a small amount of changes to the repo (for argument sake, let's say the IPW2100 and IPW2200 projects) on top of what is already provided by Repo-A.

For several reasons, we would like users to be able to get just the differences between Repo-A and Repo-B from me.

For example, the user gets the full Repo-A: [...]
and then overlays just the delta, which they obtain from me: [...]

The problem is, I just cannot find any other references to this concept (another thing frustrating the search efforts is that Gentoo has something called "git overlays" in its package manager; and TortoiseGIT has "overlay" icons). The thread itself seems to have only one reply, is from 2005, and it suggests the introduction of "ancestors file stored on the overlay repository", which was probably never implemented in git proper. While that posting actually includes bash scripts to demonstrate the concept, they are based on rsync-ing .git internals directly, which I don't really feel confident about testing.

My question is - is there a standard way (e.g. using mostly git commands, or shell scripts that would be called in context of git) in which this kind of operation can be achieved? Alternatively, are there some "filesystem overlay" tricks I could use under Linux, to achieve something to that effect?

I thought git submodules could be used, but apparently they can't; I prepared a small bash script to test that:

#!/usr/bin/env bash
set -x

rm -rf repoM-git

mkdir repoM-git
cd repoM-git
git init
git config user.name "me"
git config user.email "my@self.com"

git submodule add https://github.com/defunkt/github-gem.git repo1
git submodule add https://gist.github.com/6462971.git repo2

git status
git commit -m "initial checkin"

cd repo1
git config user.name "me"
git config user.email "my@self.com"
SOMETAG=$(git tag --list | awk 'NR==4{print $0;}')
{ echo "Checking out $SOMETAG in repo1"; } 2>/dev/null
git checkout $SOMETAG
{ echo "Creating myhack branch"; } 2>/dev/null
git checkout -b myhack
{ echo "Attempting to change"; } 2>/dev/null
echo "AHOOOOOY" >> README
git add -u
git status
{ echo "Commiting in submodule repo1..."; } 2>/dev/null
git commit -m "first change"
git status

{ echo "Going back to main repoM"; } 2>/dev/null
cd ..
git add -u
git status
git diff --cached

Running this script reports at end:

HEAD is now at b6df531... Bump the version to 0.1.3
Creating myhack branch
+ git checkout -b myhack
Switched to a new branch 'myhack'
Attempting to change
+ echo AHOOOOOY
+ git add -u
+ git status
# On branch myhack
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#   modified:   README
#
Commiting in submodule repo1...
+ git commit -m 'first change'
[myhack 0e01195] first change
 1 file changed, 1 insertion(+)
+ git status
# On branch myhack
nothing to commit (working directory clean)
Going back to main repoM
+ cd ..
+ git add -u
+ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#   modified:   repo1
#
+ git diff --cached
diff --git a/repo1 b/repo1
index 8ef0c30..0e01195 160000
--- a/repo1
+++ b/repo1
@@ -1 +1 @@
-Subproject commit 8ef0c3087d2e5d1f6fe328c06974d787b47df423
+Subproject commit 0e01195675f2e1585cdbdffb9fffb3cca2e5f547

This basically confirms that the submodule is it's own repo/work-area, with its own .git directory.; what I'd want instead, is that my "master" repository records the changes to any "child" repositories that may be included. For instance, in the example above, I'd want repoM to track not just that I've done a change in repo1, which originally is from elsewhere, in respect to its tag 'v0.1.3' (i.e., it's underlying SHA-1 commit hash) - but also record the changes (or the diff) themselves. Is this possible to do, with submodules or otherwise?

I don't understand why you can't just do this with your own published fork. If you keep your fork updated with the upstream project, other developers could either clone your repository directly, or start with the upstream one and fetch from one of your branches (since it contains all the same Git objects as the upstream one). You have to keep your code in sync with the upstream repository anyway, or else your "overlays" aren't guaranteed to apply cleanly. — ChrisGPT was on strike, Feb 01 '15 at 15:57
You may also want to note that the mailing list message that you refer to is from July, 2005 (just three months after Git's initial release). It is now almost ten years later, and *lots* has changed. — ChrisGPT was on strike, Feb 01 '15 at 15:59
Thanks, @Chris - basically, I'd want to "host" only my own changes, so users would do just do `git clone http://myrepo`, followed by `git submodule update` (for each sub) to retrieve the base code from their originating repo servers. As it is now, I'd have to host each project's _entire_ codebase _plus_ also my branch for the submodules, so that the users could retrieve everything using the same commands; but strictly from "my" server. This with "fetch from one of your branches" sounds promising - but I'd still have to host the entire, (say, kernel) git codebase, right? — sdaau, Feb 01 '15 at 17:40
( Btw, did note that 2005 is "ancient", but wasn't aware that it is just three months after initial release `:)` ) — sdaau, Feb 01 '15 at 17:41
"As it is now, I'd have to host each project's *entire* codebase *plus* also my branch for the submodules..." Are you worried about bandwidth or something? — ChrisGPT was on strike, Feb 01 '15 at 18:17
@Chris - well, yes (although all is a bit hypothetical at this point); that, and the fact that I'd like a clear separation between what is the original repo, and what are my changes. — sdaau, Feb 01 '15 at 18:22
Just dropping in to note that an "overlay" branches facility would be frigging amazing for keeping a bunch of local changes never meant for upstream. Ie you could keep local config in an overlay branch and merge it into master/whatever but those specific commits would never make it upstream because they belong to the "overlay". Can do it with stashes, but its kinda awkward and sometimes hard to reason about. — Shayne, Feb 03 '23 at 02:22

score 6 · Answer 1 · answered Feb 02 '15 at 18:13

Git is already well-suited to what you want to do, even without any extensions.

Here is one way that I might maintain my own fork of an upstream repository, using GitHub's hub repo as an example:

Clone the upstream repository and rename its remote.
```
git clone git@github.com:github/hub.git
git remote rename origin upstream
```
At this point my repository will look something like this:
```
          D---E  [master][upstream/master]
         /
A---B---C  [tag:v1.12.4]
```
Note that I have included the most recent tag, v1.12.4, in my diagram. It's always a good idea to start working from a known state.
Get to a known state.

I'll work from one of hub's releases, so I need to move my master branch to the v1.12.4 tag before I start:
```
git reset --hard v1.12.4
```

Make some changes.

After a while, my repository may now look something like like this:

          D---E  [upstream/master]
         /
A---B---C  [tag:v1.12.4]
         \
          1---2---3 [master]

Publish.

Whenever you are ready, users can use your master branch, or any new tags you may commit, to retrieve your source code. Because the commits A, B and C exist in your repository and also in the upstream repository, somebody who has previously cloned the upstream repository can simply fetch your changes, perhaps into sdaau-master.

Update.

Your changes are relative to tag v1.12.4, but what happens when the upstream repository changes? Let's say they've released a new version v1.13 and you want to support that as well.

Easy: Just git fetch upstream to get the new changes...

                            I---J---K  [upstream/master]
                           /
          D---E---F---G---H  [tag:v1.13]
         /
A---B---C  [tag:v1.12.4]
         \
          1---2---3  [master]

...and merge them into your master branch with git merge v1.13:

                            I---J---K  [upstream/master]
                           /
          D---E---F---G---H  [tag:v1.13]
         /                 \
A---B---C  [tag:v1.12.4]    \
         \                   \
          1---2---3-----------4  [master]

Rinse and repeat.

                                              N---O [upstream/master]
                                             /
                            I---J---K---L---M  [tag:v1.13.1]
                           /                 \
          D---E---F---G---H  [tag:v1.13]      \
         /                 \                   \
A---B---C  [tag:v1.12.4]    \                   \
         \                   \                   \
          1---2---3-----------4---5---6---7-------8---9  [master]

Some benefits of this approach are listed below:

Throughout all of this, your changes remain in your own branch. Of course, you can create as many of your own branches and tag as many releases as you want. Depending on the complexity of the project, this is probably a good idea.
Your work remains linked to the upstream repository. You can update your code when the upstream project gets updated, and it becomes very easy for other users to pull in your changes.
You can contribute upstream. This configuration also lets you submit patches to the upstream project quite easily. You may do this through a GitHub "fork", with their proprietary pull requests, or using standard Git commands like bundle, format-patch, apply and am.
Explicit relationship. Looking at the network graphs, it becomes very clear that your work is your own, and that it is based upon the upstream project.

The only real drawback is bandwidth, which can be mitigated by hosting your repository on a service like GitHub, GitLab or Bitbucket.

Many thanks for this @Chris - at first I was like "Why didn't I think of this", but the problem is that this requires I work in the root of the checked out repository: I cannot have the checked out repository as a subdirectory. If I want a subdirectory, then I need a submodule - and then I have the same problem as in the OP. Basically I am looking for an approach like this, but where the main root of the repo is my code, and subdirectories could be extern repos... — sdaau, May 16 '16 at 07:33
@sdaau, perhaps I still don't understand what you're asking. But it sounds like [subtrees](http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree/) might suit your needs better than submodules. — ChrisGPT was on strike, May 16 '16 at 18:33
*"You can contribute upstream."* I'm not clear how this works. Let's say commits `5`, `6`, and `7` fix a bug in `master` and I want to contribute them to `upstream/master`. Since they're built on commit `4` (and not a commit from upstream, like `H` or `M`), it looks impossible to contribute them directly without creating a separate branch off of `H` or `M` and cherry-picking `5`/`6`/`7` (or perhaps rebasing off of `H` or `M`). Is there an easier way? — Roy Tinker, May 18 '20 at 17:44
Yes, you'd probably want to cherry-pick or rebase those commits onto `upstream/master`. That's pretty straightforward; I'm not sure how much easier you expect it to be. — ChrisGPT was on strike, May 18 '20 at 19:41
Git stash is probably better suited for this than a concrete branch. See my answer. — tonywl, Mar 30 '22 at 03:33
@tonywl, if the change is very small and never changes, maybe. But for anything more substantial a real branch is likely better than a stash. Consider that the last graph above shows seven commits that I've made and two merges back from upstream across three tagged releases. — ChrisGPT was on strike, Apr 20 '22 at 11:48

IljaBek · Answer 2 · 2017-08-18T13:59:51.590

I was in the same situation and found a solution elsewhere on SE, I'll try to describe:

given two repos and directory structures:

https://github.com/company/full_project.git

/full_project/subfolder/a.txt

https://github.com/me/project_delta.git

/a.txt

one needs to redirect the .git directory:

CurrentDir=$PWD
git clone https://github.com/company/full_project.git full_project
git clone https://github.com/me/project_delta.git project_delta
echo "gitdir: ${PWD}/project_delta/.git" > ${PWD}/full_project/subfolder/.git

now running git status in ${PWD}/full_project/subfolder gives the 'uncommited' changes of a.txt present in the project_delta

cd full_project/subfolder
git checkout .

that should do it - now the "changes" are reset to state of me/project_delta.git

tonywl · Answer 3 · 2022-03-29T03:17:39.000

There are two use cases that I can think of:

your own changes modify existing code from upstream; or
your own changes add code (utility tools to setup env, run etc)

In the first case, the easiest way I find is to keep personal changes in git stash on a semi-permanent basis and apply them whevever after pulling the latest changes from upstream.

In the second case, additional changes can be checked in and versioned in a separate git repository, they can be applied and updated by setting GIT_DIR = and GIT_WORK_TREE=, something like

GIT_DIR=$external_git_repo_dir GIT_WORK_TREE=$project_dir git status

Maybe use a shell alias/function/script to facilitate the git operations.

Possibility for git "overlays" (storing only differences to extern repositories in a local repository)?

3 Answers3