Creating a GitHub repository with only a subset of a local repository's history

Question

The background: I'm moving closer to open sourcing a personal research code I've been working on for more than two years. It started life as an SVN repository, but I moved to Git about a year ago, and I'd like to share the code on GitHub. However, it accumulated a lot of cruft over the years, and I'd prefer that the public version begin its life at its current status. However, I'd still like to contribute to it and incorporate other people's potential contributions.

The question: is there a way to "fork" a Git repository such that no history is retained on the fork (which lives on GitHub), but that my local repository still has a complete history, and I can pull/push to GitHub?

I don't have any experience in the administrating end of large repositories, so detail is very much appreciated.

Ok, I think I've got a good title now. I'm looking forward to the answer to this. If it's feasible, I'm bound to learn some git wizardry. — R. Martinho Fernandes, Apr 20 '11 at 01:35
@Martinho I think [grafts](https://git.wiki.kernel.org/index.php/GraftPoint) are the wizardry you're looking to learn! — Brian Campbell, Apr 20 '11 at 02:06
Actually, looks like I had to learn some Git wizardry to provide the simplest answer to this question. I've learned two new features just in answering it! — Brian Campbell, Apr 20 '11 at 05:52
How does my answer work for you? I'm happy to clarify if I managed to confuse you anywhere. I wrote under the assumption that you know how to create a repo on GitHub and push to it already, but if you don't, I can add that to my answer. — Brian Campbell, Apr 20 '11 at 19:20
@Brian: I just saw your answer: it looks great, but I haven't had time to test it yet and will do so this evening. (I'll probably just clone my repository locally to check it since I'm not quite ready to push the whole thing to GitHub.) Thanks! — Seth Johnson, Apr 20 '11 at 19:34

score 69 · Accepted Answer · edited Apr 16 '20 at 11:52

You can create a new, fresh history quite easily in Git. Let’s say you want your master branch to be the one that you will push to GitHub, and your full history to be stored in old-master. You can just move your master branch to old-master, and then start a fresh new branch with no history using git checkout --orphan:

git branch -m master old-master
git checkout --orphan master
git commit -m "Import clean version of my code"

Now you have a new master branch with no history, which you can push to GitHub. But, as you say, you would like to be able to see all of the old history in your local repository; and would probably like for it to not be disconnected.

You can do this using git replace. A replacement ref is a way of specifying an alternate commit any time Git looks at a given commit. So you can tell Git to look at the last commit of your old branch, instead of the first commit of your new branch, when looking at history. In order to do this, you need to bring in the disconnected history from the old repository.

git replace master old-master

Now you have your new branch, in which you can see all of your history, but the actual commit objects are disconnected from the old history, and so you can push the new commits to GitHub without the old commits coming along. Push your master branch to GitHub, and only the new commits will go to GitHub. But take a look at the history in gitk or git log, and you'll see the full history.

git push github master:master
gitk --all

Gotchas

If you ever base any new branches on the old commits, you will have to be careful to keep the history separate; otherwise, new commits on those branches will really have the old commits in their history, and so you'll pull the whole history along if you push it up to GitHub. As long as you keep all of your new commits based on your new master, though, you'll be fine.

If you ever run git push --tags github, that will push all of your tags, including old ones, which will cause all of your old history to be pulled along with it. You could deal with this by deleting all of your old tags (git tag -d $(git tag -l)), or by never using git push --tags but only ever pushing tags manually, or by using two repositories as described below.

The basic problem underlying both of these gotchas is that if you ever push any ref which connects to any of the old history (other than via the replaced commits), you will push up all of the old history. Probably the best way of avoiding this is by using two repositories, one which contains only the new commits, and one which contains both the old and new history, for the purpose of inspecting the full history. You do all of your work, your committing, your pushing and pulling from GitHub, in the repository with just the new commits; that way, you can't possibly accidentally push your old commits up.

You then pull all of your new commits into your repository that has the full history, whenever you need to look at the entire thing. You can either pull from GitHub or your other local repository, whichever is more convenient. It will be your archive, but to avoid accidentally publishing your old history, you don't ever push to GitHub from it. Here's how you can set it up:

~$ mkdir newrepo
~$ cd newrepo
newrepo$ git init
newrepo$ git pull ~/oldrepo master
# Now newrepo has just the new history; we can set up oldrepo to pull from it
newrepo$ cd ~/oldrepo
oldrepo$ git remote add newrepo ~/newrepo
oldrepo$ git remote update
oldrepo$ git branch --set-upstream master newrepo/master
# ... do work in newrepo, commit, push to GitHub, etc.
# Now if we want to look at the full history in oldrepo:
oldrepo$ git pull

If you're on Git older than 1.7.2

You don't have git checkout --orphan, so you'll have to do it manually by creating a fresh repository from the current revision of your existing repository, and then pulling in your old disconnected history. You can do this with, for example:

oldrepo$ mkdir ~/newrepo
oldrepo$ cp $(git ls-files) ~/newrepo
oldrepo$ cd ~/newrepo
newrepo$ git init
newrepo$ git add .
newrepo$ git commit -m "Import clean version of my code"
newrepo$ git fetch ~/oldrepo master:old-master

If you're on Git older than 1.6.5

git replace and replace refs were added in 1.6.5, so you'll have to use an older, somewhat less flexible mechanism known as grafts, which allow you to specify alternate parents for a given commit. Instead of the git replace command, run:

echo $(git rev-parse master) $(git rev-parse old-master) >> .git/info/grafts

This will make it look, locally, as if the master commit has the old-master commit as its parent, so you will see one more commit than you would with git replace.

You should probably use `git replace` in preference to grafts; I believe the latter has been deprecated. http://progit.org/2010/03/17/replace.html http://www.kernel.org/pub/software/scm/git/docs/git-replace.html — Emil Sit, Apr 20 '11 at 03:27
@Emil Ah, I'd missed the introduction of `replace`. Thanks for the heads up, that does look a bit easier to deal with. — Brian Campbell, Apr 20 '11 at 04:26
`git checkout --orphan` (Git 1.7.2 and later) obviates the additional repository (and copying the files, etc.): `git branch -m master old-master && git checkout --orphan master && git commit -m 'initial public release'` — Chris Johnsen, Apr 20 '11 at 05:23
@Chris Thanks for pointing that out! Wow, having switched to a new job using Perforce for the past year has really kept me out of the loop on new Git features. — Brian Campbell, Apr 20 '11 at 05:29
Whoa, there's a gotcha: tags. (I fortunately tested this out locally before sending it to github.) If you have old tags, and you `git push --tags origin master`, all of your old tags and their history will be sent as well! — Seth Johnson, Apr 23 '11 at 23:37
Just make sure to run `git tag -d $(git tag -l)` before `git push --tags origin` — Seth Johnson, Apr 23 '11 at 23:40
@Seth Ah, yes, that is a gotcha. I'll edit to mention tags so it doesn't trip anyone else up who comes across this post. — Brian Campbell, Apr 24 '11 at 02:32
Wow, that's pretty cool! I seem to be missing something though: This seems kind of hackish to me, why doesn't GitHub just let you do shallow pushes in the same way it lets you do shallow fetches? — Zaz, Jun 04 '15 at 04:37
@Zaz Basically, in Git having commits without their parents is considered to be a broken state. To make it convenient to clone a large repository without having to bring all history with it, you can do shallow fetches, but such a repository can only do a subset of the things that you can do with the full history. There's no support in Git for pushing shallow history, as once you are sharing history with other people rather than just grabbing the latest version for your own use, you really do need the full history or else lots of things won't work. — Brian Campbell, Jun 05 '15 at 16:13
Fair enough. I guess what I'm saying is that I don't see the logic in Git treating initial commit objects as fundamentally different to normal commit objects. Yes, the state of being an initial commit is fundamentally different from the state of being a normal commit, but the commit objects themselves are not fundamentally different. Removing the `parent ` line will turn a normal commit into an initial commit, but has the undesirable side-effect of changing the commit, and thus, its hash, forcing you to rebase all of its descendants. — Zaz, Jun 05 '15 at 23:02
The two separate repositories, does that require the oldrepo already being set up with the old-master and orphaned master? — user2375667, Jun 21 '15 at 13:03
@user2375667 Yes, in this example, setting up `newrepo` is a step that happens after you set up the `old-master` and new orphaned `master` branch in `oldrepo`. This way, `oldrepo` has the full history (and can pull in new commits from `newrepo`), while `newrepo` is where you do new work and push to GitHub, in order to avoid accidentally pushing some of the old history. — Brian Campbell, Jun 21 '15 at 16:37

score 2 · Answer 2 · edited Apr 16 '20 at 11:55

2

Brian's answer seems to be complete and knowledgeable, yet a bit complex.

The easy(ier) solution would be to keep two repositories.

A private GitHub repository which you work on. You do all of the full-history pushes to that repository.

The second repository is a public GitHub repository to which you publish only when you want to "release" a new version to the public. You publish to it by using a simple diff + patch, and then commit + push.

edited Apr 16 '20 at 11:55

Peter Mortensen

30,738
21
105
131

answered Apr 20 '11 at 11:48

Guy

12,250
6
53
70

3

While my solution is slightly more complex to set up (though not much; I gave a fairly detailed explanation with a few alternatives for older versions of Git which makes it look complex, but in reality it's fairly simple), it is much simpler once you've set it up. In my solution, you just commit and push as normal, you don't have to do extra work to apply every change to one repo and then again as a diff + patch to your new repo. – Brian Campbell Apr 20 '11 at 12:21
1

@Brian Thanks. I guess it wasn't clear from your post. I take your word for it and stand corrected. – Guy Apr 20 '11 at 19:06

score 1 · Answer 3 · edited Apr 16 '20 at 11:59

A very simple and interesting way of doing this is as below -

Say you have commits C1 to C10 in REPO-A, where C1 is the initial commit and C10 is the latest HEAD. And you want to create a new REPO-B such that it has commits C4 to C8 (a subset).

NOTE: Using this method would change the commit SHAs (for example, C4' to C8' in this case), but the changes each commit holds will remain the same, and your first commit now will begin with all the changes of your earlier commits till that point combined.

What should I do?

Recursively copy everything over on your local machine

cp -R REPO-A REPO-B

Optionally remove all remotes from your REPO-B, since most probably you want to use this as a separate repository.

cd REPO-B
git remote -v
git remote remove REMOTE_NAME

Force move the branch pointer to the later end of your subset. For subject C4 to C8 that would be C8. But most likely you would need subsets till the HEAD (for example, from C4 to C10 or C6 to C10) in which case the below step is not required.

git checkout -b temp
git branch -f master C8
git checkout master
git branch -D temp

Enter the commit SHA of the earlier end of your subset in the file .git/info/grafts directory. In this case it is the SHA of commit C4.

git rev-parse --verify C4 >> .git/info/grafts

Do a Git branch filtering without any arguments:

git filter-branch

Or it that does not work:

git filter-branch --all

You can now push this to a separate/new remote if you want to:

git remote add origin NEWREMOTE
git push -u origin master

How does it work?

This link tells you how it actually really works - http://git.661346.n2.nabble.com/how-to-delete-the-entire-history-before-a-certain-commit-td5000540.html

You can read about grafts on the git-filter-branch(1) manpage, in gitrepository-layout(5) Git repository layout description, and in gitglossary(7), a Git glossary.

In short, each line in .git/info/grafts consist of an SHA-1 id of an object, followed by space-separated list of its effective (grafted) parents. So to cut history, e.g. after commit a3eb250f996bf5e, you need to put in a line containing only this SHA-1 in .git/info/grafts file, e.g.:

$ git rev-parse --verify a3eb250f996bf5e >> .git/info/grafts

Creating a GitHub repository with only a subset of a local repository's history

3 Answers3

Linked

Related