How to combine two git repositories such that some folders are present only in one of them?

Question

I am working on a little web app for organizing lecture notes. The app and some dummy preview content is hosted on Gitlab and accessible via Gitlab Pages. It looks like this:

project-name/web <- the actual code
project-name/tex <- dummy content

On my local machine there is proper content, as well as further content folders, all of which is untracked and thus not present in the Gitlab repo, because those are lecture notes which shouldn't be publicly accessible. It looks like this:

project-name/web
project-name/tex <- dummy and proper content
project-name/folder1 <- further content
project-name/folder2 <- further content

Now I would like to host the app with proper content on my Raspi (using nginx). I created a (bare) git repo on the raspi, added the complete project files including the proper content (all the folders) to that repo and set up a git hook to deploy it to the nginx server, that is, copy the files into /var/www/html and run some PHP script which is also necessary.

But now I have two repos, Gitlab and Raspi, and would need to do all the changes to the code twice. I researched about how to combine the two repos and got the hint that it might be possible to add the "web" folder which is common to both repos as a submodule of the Raspi repo, then do the code changes to the Gitlab repo and pull them into the Raspi repo's submodule. But it doesn't quite work because "web" is a subfolder of the the Gitlab repo rather than the whole repo. So people pointed me to sparse commits to pick out only one subfolder, but this preserves the folder structure and thus also doesn't work properly.

I'm not very experienced with git, only know the very basic commands and those submodule and sparse commit things seem rather involved to me and I cannot judge if they suit to solve the problem.

I am pretty sure that my scenario is not uncommon but I still failed to find a fitting solution, so any hint to some reading is greatly appreciated!

score 3 · Answer 1 · answered Apr 11 '20 at 18:34

Git doesn't store folders.

In one sense, Git doesn't even store files. What Git stores—at the level you'll use it, anyway—is a big database of commits,¹ plus a smaller database of names. It's the commits that store the files. This might seem like a niggling difference, but it's really the difference, and the key to the whole thing.

Combining two Git repositories consists of taking all the commits in both original repositories and putting them into one big combined pile. Constructing the desired set of names for the resulting enlarged database is usually the main problem, but you're skipping right past that to a second problem of your own invention. As we'll see towards the end, this might not be what you want after all.

Anyway, the first thing you'll need to know here is what a commit is and does, since that's the level at which you can actually use Git itself. Let's start with the simple but annoying fact that each commit has a unique hash ID, a big ugly string of letters-and-numbers like 9fadedd637b312089337d73c3ed8447e9f0aa775. This is, in effect, the true name of the commit: it's how Git finds the object in its big database.

Each commit stores some set of files: no folders, just files. The files stored with a commit—the commit's main data, as it were—are in a special, read-only, Git-only, compressed format.² The commits and its files are frozen for all time, so in order to use them or change them, Git has to extract them (which we'll get to in a moment). These form a snapshot in time, as it were: Your files looked like this, at the time you made this commit.

Besides this snapshot, each commit also holds some metadata, such as who made it, when, and why. Most of this metadata is for human consumption, but one part is for Git itself: every commit stores a list of the raw hash IDs of its immediate parent commits. Most commits have exactly one parent. When we have single-parent commits like this, they form a backwards-looking chain of commits:

... <-F <-G <-H

This chain eventually ends (on the right, here) with whatever the last (most recent) commit was. It has some big ugly hash ID, but I've just used the letter H to stand in for that hash ID. The commit is in Git's big database, retrievable by that hash ID. Inside the commit is the hash ID of its parent G, so given commit H, Git can find and retrieve G. G of course has parent F, so now Git can get F, which has a parent, and so on. This goes back through time, eventually to the very first commit, which—being first—simply has no parent.

A branch name simply holds the (single) hash ID of the last commit. So if there are just the eight commits A through H in this repository, and only one branch name master, we have:

A--...--G--H   <-- master

as the entire repository. Each of these eight commits has its snapshot of all files. Git will show you what changed, between any pair of commits, by extracting, into a temporary area (in memory), the commit and its parent and seeing which files are the same—about which Git will say nothing at all—and which are different. For those that are different, Git will give you a recipe by which you can modify the earlier commit to turn it into the later one.

To add a new commit, you will:

Have Git extract the last commit of the branch, into a working area: this is your working tree or work-tree. Git also puts copies of the frozen-format, compressed-and-Git-ified files into Git's index at this point.³ This last commit is now the current commit, and the branch name you used—the master in git checkout master, for instance—is the current branch.
Fuss with the work-tree copies however you like.
Use git add to copy updated work-tree files back into Git's index.
Run git commit. This collects some metadata from you and your settings, the current date and time, and so on; uses the current commit as the parent for the new commit; uses whatever is in Git's index right now as the new frozen-for-all-time files, and writes out a new commit. The writing of the new commit gives it its new unique hash ID.

Git now stores the new commit's hash ID into the current branch name. So where master used to point to H, it now points to a new commit we'll call I, which points back to H:

...--G--H--I   <-- master

This is how branches grow.

Note that I has a full snapshot of every file, just like H did. These are the files you'll get in your work-tree later, if you check out commit I.

¹Technically, this is Git's object database and you may also interact directly with tag objects sometimes, if you use annotated tags.

²Technically, what Git is storing in the commit is the hash ID of a tree object. Tree objects have entries, with each entry giving a file's name or part of it, its mode, and the hash ID of the blob object holding the file's content. Tree objects could allow Git to store folders, but Git builds and uses these tree objects through Git's index, which only allows file entries, so that Git winds up never storing a folder.

³The index, mentioned in footnote 2, is how Git builds the next commit. It has some additional uses and we won't go into detail here. It doesn't literally store copies of files: it stores mode, file-name (a full path such as path/to/file), and Git blob-object hash IDs. At this level, though, you can just think of the index as holding a copy of the file in the frozen format, ready to go into the next commit.

Combining repositories

If you want to combine two repositories into one big one, you:

Probably, start by cloning one of the two repositories, so that you're working with a copy in case you mess up. This gets you a copy of all the commits. Being a clone, this copy has its own branch names: the original's branch names have all been renamed and are now origin/master, origin/dev, etc., instead of master and dev and so on.

The cloning process takes a name—git clone -b branch—as the name it should create for you. If you don't give it one, it asks the origin Git which branch it recommends. Usually it recommends master. So your clone usually ends up with a master branch, which your Git sets up to point to the same commit that your Git set your origin/master to, based on their master.

(Look back at the drawings above, and see how this makes your master equal to their master.)
Have Git add all the commits from the second repository into this copy. As before, have your Git rename all their branches. We'll see how this works in a moment.

Branch names, and all Git's other name-to-hash-ID mapping entries, make up the other database in a Git repository. We saw above how a branch name selects the last commit in a chain of commits, and how cloning renames the other Git's branch names. These origin/* names are remote-tracking names,⁴ which simply remember where the other Git's branch names pointed, the last time I talked to that other Git and got a list of the commits to which its branch names pointed.

To get commits from another Git, you need a URL (or sometimes, a path name on your computer, but we'll just pretend that's a URL here). When you clone a Git repository, you give Git a URL: git clone ssh://git@github.com/user/repo for instance. Your Git:

makes a new, empty directory (usually—you can point it to an existing empty directory) and enters that directory for the rest of these steps;
git init: makes a new, empty Git repository here;
git remote add ...: adds a remote name, by default origin, storing the URL;
does any extra configuration you ask for;
runs git fetch on the new remote; and
last, runs git checkout to create and check out master or whatever name you chose.

Step 5 has your Git call up the other Git, using the stored URL. The other Git hands over any commits it has that your Git doesn't have—which is all of their commits—after listing out all their branch names and the tip commit hash IDs (and tag names and other names but we'll ignore this complication here).

It's this step that copies over all their commits and creates or updates your remote-tracking names. So if we want to add all the commits from another Git, we just need to run:

git remote add <name> <url>

You pick some name—second, another, whatever you like—and the URL. Your Git adds a new remote, storing this URL. Then you can run:

git fetch <name>

This has your Git call up the other Git. They list out their branch names (and other names that we're ignoring) and last commit hashes, and your Git asks for those commits and every other commit those commits have as parents, recursively, all the way back to the very first commit in that repository.

Let's say you used the name two for this second Git. You now have remote-tracking names of the form two/*, such as two/master and two/develop and so on, to find the last commits in each of the various branch names from that Git.

It is now up to you to make new commits that combine whatever files you like from each of these two repositories.

⁴Git calls these remote-tracking branch names, which people often shorten to remote-tracking branches. However, they're not branch names at all, in that if you give them to git checkout or git switch, you end up in what Git calls detached HEAD mode: not on a branch. I find it's less confusing to just call them remote-tracking names: they track the remote's branch names for you, so they're names, and they do the remote-tracking thing, so that's what we should call them.

Interlude

Note that the commits in a repository are the history. There is no file history because there aren't really any files. There are just commits, which store snapshots and have linkage. Later commits point back to earlier commits. The history exists because later commits point back to earlier commits. Git can start at the ends, and work backwards, and that's the history.

Names find commits. Each name finds one specific commit. If you work backwards from there, you get history. If you just stay there, well, then you have a commit, and the commit has files, and you can extract the files and work with them.

Making a combining-commit

Given two branch tips like this:

...--o--J   <-- branch1

...--o--L   <-- branch2

you can pick one of these two commits, such as J, by its branch name—git checkout branch1—and run git merge branch2.

Ideally, these two branches actually start from a common starting point: a shared commit, that's on both branches. That is, this really looks like:

          I--J   <-- branch1 (HEAD)
         /
...--G--H
         \
          K--L   <-- branch2

where commit H is the obvious best-common-shared-commit on both branches.

The HEAD I drew in here is how Git remembers which branch name you did a git checkout on: Git attaches the special name HEAD to just one branch. That's the one that Git extracted to Git's index and your work-tree, too, i.e., those are the files you can actually see and work with right now, from commit J. That one name, HEAD, provides both the current branch name and—indirectly, by the branch name pointing to a commit—the current commit.

You now run:

git merge branch2

and Git locates commit L, which branch2 points-to. The merge code now works backwards from both of these commits, J and L, to find commit H on its own. This commit H is the merge base of the two branches.

To accomplish the merging action—the merge as a verb, as I like to call it—Git now runs two comparisons, starting with the snapshot in commit H both times. The git diff command lets us run the same comparison and hence think about what Git sees:

git diff --find-renames hash-of-H hash-of-J finds what we changed on branch1;
git diff --find-renames hash-of-H hash-of-L finds what they changed on branch2.

The merge action now combines the two sets of changes. Whatever we did to a file in H, Git can do that again, and also add to it whatever they did to the same file in H. Doing that for every file, and doing any whole-file changes—like adding an entirely-new file, if we or they did that—modifies the snapshot in H into a new snapshot, ready to go.

If all of that goes well, Git will now make a new merge commit, which we can draw as commit M:

          I--J
         /    \
...--G--H      M   <-- branch1 (HEAD)
         \    /
          K--L   <-- branch2

Git adjusts the name branch1 as usual, to point to the new merge commit M, which has a snapshot as usual. The only thing that's not "as usual" is that new commit M has two parents, J and L.

This means that if we try to look at M to see what changed, the usual trick—compare M vs its parent—doesn't work. There is not a parent; there are parents, plural. What Git does for this depends on what command you use to look at M, but a lot of the time, it just gives up and doesn't show any diff at all! It's often hard to see past a merge. Technically, a merge can have more than two parents, too.

When traversing history, Git will generally either go down one "leg" or "side" of the merge, or down all of them. Again, we won't get into all the details here: it gets kind of complicated, very fast. A simple git log, though, will go down both legs, in some order, one commit at a time.

Anyway, the real point here is that merge commit M ties two histories back into one. From branch1, we visit commit M; then commits J and L and I and K, in some order. Usually we hit all those before we go back to commit H, where things simplify, and then we go on to visit commit G, F, etc., as usual. So all these commits are now on branch1. We don't even need the name branch2 any more: it identifies commit L, but M reaches L if we go down its second leg. We can delete the branch2 name if we want, now.⁵

⁵If we don't delete branch2, we can make more commits on branch2, and those won't be on branch1. Later, we can then git checkout branch1 and git merge branch2 again. This time the best shared commit will turn out to be commit L. This is how long-running, repeated-merge operations work: merges change the set of reachable commits on one branch, which make future merges into that branch work better. At least, we hope it's better: sometimes it's just differently.

Your case is a little bit different

You might at this point want to use:

git checkout master
git merge two/master

for instance, to make a combining commit. But in modern Git, you'll get an error:

fatal: refusing to merge unrelated histories

The problem here is that there is no shared commit. Old versions of Git do, or at least try, the merge anyway, using a fake commit with no files in it: Git's empty tree.

You can enable this yourself, as if you had an old Git:

git merge --allow-unrelated-histories two/master

Git will now use the fake empty commit as the common starting point. Every file in both branch-tip commits will be "newly added". If all the file names are different, the merge will succeed on its own, by putting all the files into the new commit.

If this isn't want you want—and it isn't—you might want to make sure that Git doesn't make the commit on its own, by using:

git merge --allow-unrelated-histories --no-commit two/master

This ensures that Git stops, with the merge incomplete, as it would if anything went wrong with Git combining the two commits on its own.

If any file names collide, though, you'll get an "add/add conflict" anyway, and Git will stop. The problem here is that Git doesn't know which file to use. Should it use the one from your current commit as selected via HEAD / master? Or should it use the one from the other commit as selected via two/master?

Your job is now to provide the correct set of files for the merge snapshot. You do this in both your work-tree, where you can see and work with files, and in Git's index (which you can't really see very well: git status tells you what's different in Git's index, rather than what's in Git's index, so it's comparing the index copies of files to other copies).

You may want to git rm or git rm --cached some specific files from Git's index (we won't worry about this here), but mostly you'll want to fix up the work-tree copies, and then just git add the work-tree copies to have Git copy the correct files into its index. As you do, Git will mark each conflicted file as resolved: git status will move them out of the special (merge-only) conflicted section.

You should know that git status tells you what will be committed ("staged for commit") by:

comparing the current (HEAD) commit's frozen files to the frozen-format files in the index
for each file that is the same, say nothing at all
for each file that is different, mention the file's name

so if HEAD is master which is also origin/master, you can know which files those are by looking at the other clone you have, that is just of your first original repository, and see what files are checked-out there.

Once all merge conflicts are resolved, git status also tells you what's in your work-tree that is different from what's in Git's index. These are the changes not staged for commit.

To finish the merge and make a new merge commit that ties the two histories together, you now need only run:

git merge --continue

or:

git commit

(the merge --continue just checks that there is a merge to finish, then runs git commit, so these do the same thing in this case).

The files that go in the new merge commit snapshot are those in Git's index at this time. So all this work is just to put the right files into the index. That's what this is all about. Git stores commits, not files; commits contain files, as a snapshot, made from whatever is in Git's index; the commands you use manipulate the index, and make new commits.

You don't have to combine repositories if you don't want to

If all you want is to get a bunch of files from somewhere, and add them to a new commit in some existing or new clone, just do whatever it takes to get the files. Clone a repository if desired, or switch over to the existing clone. Use any commands you like to copy files into place. Use git add to copy those files into Git's index, where they have path names like folder1/file, because in your work-tree, you have a folder1 containing a file named file.

Once the index has the right set of files in it, run git commit to make a new commit on the current branch. Git will collect up the metadata, write out the new commit with the new snapshot, and store the new commit's hash ID in the current branch name. The new commit will point back to the previous commit. That's what Git is all about: adding new commits. We find them by branch names; we compare them by git diff-ing; we do other fancier Git commands that do other things with them. But it's the commits that matter.

It's the commits that matter

Note that because it is the commits that matter, you can, if you like, use git merge to tie two histories together, without worrying about the snapshot in the merge. You can then make a second commit that fixes things that were wrong with the merge.

For instance, if Git can merge two otherwise unrelated histories on its own (perhaps with --allow-unrelated-histories) but this saves too many files, so what? You can let Git do that, then remove the unwanted files and make a second commit.

Git commits share their files. Every commit is totally read-only, frozen for all time. You either have the commit or you don't, and if you do have the commit, it has all its files. If its files match those of a previous commit, Git knows that it's safe to share the files across both commits. There's only one actual frozen copy.

So, if you take two different repositories and combine their commits into one repository, you have all of the commits and all of the files already. Making a merge commit that, if you check it out, gets you too many files, takes no extra space—well, just a tiny bit of space for the merge commit itself. A subsequent commit where you remove a bunch of files takes a tiny bit of space, to record the new commit that says to re-use only some subset of files.

Checking out the commit that comes after the merge extracts, into your work area, only those files that are in that commit—so you won't see the extra files anyway. They will be in your history, but they will be there whether or not they're in your merge.

The choice is yours: Git will store whatever you tell it to. You'll have the commits that you have, whatever those are, and you cannot change any existing commit, but you can choose which one is your last commit. You can even make a new history that consists of one commit with just the right files:

...--W--X--Y   <-- master

 Z   <-- new-history (HEAD)

where Z has no parent. If you now delete all the names that find all the other commits, such as master:

git branch -D master

giving:

...--W--X--Y   ??? [can't find Y any more!]

 Z   <-- new-history (HEAD)

Git will eventually drop all the other commits.

To make this go faster, git clone this repository; your clone won't have an origin/master, just an origin/new-history. You can call that master now in the new clone, which consists of just one commit with the right files. Its history cannot be related to the original repository's history, though.

To achieve this state, if you want it, see git checkout --orphan. You can run:

git checkout master
git checkout --orphan new-history
git commit

and you will get this new Z commit with no parent, with the same snapshot Git has as the current tip commit of master. The index didn't change: git checkout master filled it in, but git checkout --orphan new-history doesn't empty it.

This usually isn't the right thing to do, but if you understand how and why this works, you now get a lot of what Git is about.

I am seriously overwhelmed by this answer... Started reading it a few times but didn't finish. :) I think, I need more time for this. But looks like I will learn a lot by reading it, thanks a ton! — Photon, Apr 13 '20 at 20:15