3

I'm attempting to manage an open source project on GitHub with a hybrid public/private workflow similar to the one described here: https://stackoverflow.com/a/30352360/204023

Essentially this describes a process where there are two repositories which mirror each other, without GitHub's fork relationship. This allows you to use standard git remote repositories to push/pull changes between branches, and public GitHub pull requests to merge private changes into the master branch. Exactly what I'm trying to accomplish.

I have one extra requirement where I would like to truncate the PUBLIC commit history in case it contains sensitive data, while maintaining the PRIVATE commit history.

Initialize the new project with --depth 1 turns out to be illegal, you can't initialize a new repo with a shallow clone: ! [remote rejected] master -> master (shallow update not allowed)

The solutions I've found for truncating the commit history involve creating a brand new repository, but with a new copy of the repo I can no longer push/pull between public/private copies.

Winder
  • 1,984
  • 5
  • 23
  • 33

2 Answers2

3

The history in a Git repository is the commits. The commits contain both the files, and the linkage: each commit has a complete snapshot of all files, plus the hash ID(s) of its parent(s). Each branch name stores the hash ID (singular) of the latest commit, and Git finds the history by starting from the end and working backwards, one parent at a time.

Since the hash ID of each commit is a cryptographic checksum of the contents of that commit—including the parent hash—the hash ID of the last commit depends on the hash IDs of every commit in the history formed by walking backwards from that commit through every other reachable commit. (In technical terms this is a form of Merkle tree.)

The implication of all of this is that it's possible to keep a shorter version of the repository DAG public, and a longer version (shorter plus added commits) as the private one, but it's not possible to have a public version that omits some historical commits while keeping others. You can also, or instead, use parallel graphs, i.e., independent DAGs: one that contains the public history, and one that contains the private history. If you do this through Git submodules, you can be reasonably sure of not releasing the private information, but that does impose a strong structural constraint: the public stuff must all be a subdirectory.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thank you for the thorough explanation. If I were to implement a solution with parallel DAGs, would it be possible to merge changes between them? I'm guessing that would require diff's and patches, which seems pretty unconventional and maybe probably has other drawbacks. It looks like I have to choose between archiving the history and re-creating public/private mirrors, or having a complicated workflow to synchronize the two. – Winder May 21 '19 at 17:49
  • You're correct, you end up stuck with diff-and-patch or similar. The submodule approach works pretty well with some special cases, e.g., some core software that's public along with plugins that are in a separate (private) repo. – torek May 21 '19 at 18:59
0

I have one extra requirement where I would like to truncate the PUBLIC commit history in case it contains sensitive data, while maintaining the PRIVATE commit history.

How if you work it the other way. You mainly work on the PUBLIC and merge the commit to PRIVATE to further process it as you pleased. If then you want some commit of PRIVATE go to the PUBLIC, use git cherry pick.

_

To make a single local git repo (that able to merge each other) points to two separate remote, use a private branch and git upstream.

git clone git@github.com:USER_NAME/PUBLIC_REPO.git
git remote add private-remote git@github.com:USER_NAME/PRIVATE_REPO.git
git checkout -b private-branch
git push -u private-remote private-branch:private-branch

See my medium post for long explanation.

Muhammad Yasirroni
  • 1,512
  • 12
  • 22