12

I'm looking for a way to set up git respositories that include subsets of files from a larger repository, and inherit the history from that main repository. My primary motivation is to be able to share subsets of the code via GitHub.

I currently manage my research-related (mostly Matlab) code via a single git repository. The code itself is loosely organized into a handful of folders, with code dependencies that often cross over folders. I don't want to upload a remote copy of the whole repository, because it includes a lot of mixed projects that no one else would want in its entirety.

My mental picture of this involves a separate repository for each project that tracks only the relevant files for that project, but inherits all the commits from the main repository. Ideally, I'd like to be able to tag versions within these sub-repositories separate from the main one, but that's not a necessity. I've looked into git submodules, subtrees, and gitslave, but all of these seem to assume that the subprojects are isolated collections of files, while in my case many subprojects share files with other subprojects. I also attempted to create a project-specific branch, git rm-ing irrelevant files, but that fell apart as soon as I needed to merge changes from the main branch into the project branch (a mess of conflicts due to changes in project-deleted files).

The stats:

  • 8096 files in main repository
  • 14 subprojects I want to share
  • 394 total files in those subprojects
  • 276 files belong to only 1 project, 57 to 2, 60 to 3, and 1 to 6.

I currently share code by simply copying the relevant files to a new folder periodically for each project. But this means that the new copies have no commit history attached. Is there a more robust method of sharing these various subsets of code, and keeping them up to date with changes I make?

Kelly Kearney
  • 123
  • 1
  • 7
  • The terms you're looking for are `submodule` and `subtree`. They're two different solutions to similar problems such as you described. Research them both and choose for yourself. – Adam Aug 01 '18 at 01:26
  • This will not work well (my gut feeling). You can, however, take your current repository, make a copy and rework all its commits so only the files you want to be in there are present. You can then share that. Or move your work over there. – Thorbjørn Ravn Andersen Dec 26 '18 at 11:32
  • some info on `git subtree`: https://stackoverflow.com/a/61273621/8910547 – Inigo Apr 28 '20 at 02:22

3 Answers3

2

As I understand your question

  • you have one big repo containing multiple subprojects
  • you want to extract and share each subproject as its own repository, still containing the history/commits for (only) that subproject
  • the subprojects share some files => this implies that the files used by one subproject are not strictly contained in a single subdirectory since one file may be used in multiple subprojects and this is why you can't simply use git subtree or git submodules

One way to extract the history of just a subset of the files into a dedicated branch (which you then can push into a dedicated repository) is using git filter-branch:

# regex to match the files included in this subproject, used below
file_list_regex='^subproject1/|^shared_file1$|^lib/shared_lib2$'

git checkout -b subproject1 # create new branch from current HEAD

git filter-branch --prune-empty \
  --index-filter "git ls-files --cached | grep -v -E '$file_list_regex' | xargs -r git rm --cached" \
  HEAD

This will

  • first create a new branch subproject1 based on the current HEAD (git checkout -b subproject1)
  • traverse its whole history (git filter-branch [...] HEAD)
  • remove all files (xargs -r git rm --cached) that are not part of the subproject (git ls-files --cached | grep -v -E '$file_list_regex')
  • All commits that did not touch one of the subproject files will be dropped from that branch (--prune-empty).
  • This operation does not checkout each revision but operates only on the index (--index-filter/--cached).

This is a one-time operation though but as I understand your question you want to continously update the extracted subproject repositories/branches with new commit. The good news is you could simply repeat this command since git filter-branch will always produce the same commits/history for your subproject branches - given that you don't manually alter them or rewrite your master branch.

The drawback of this is that this would filter-branch the complete history each time and for each subproject again and again. Given that you only want to add the last 5 commits of the master branch to the tip of your existing subproject1 branch you could adapt the commands like this:

# get the full commit ids for the commits we consider
# to be equivalent in master and subproject1 branch
common_base_commit="$(git rev-parse master~6)"
subproject_tip="$(git rev-parse subproject1)"

# checkout a detached HEAD so we don't change the master branch
git checkout --detach master

git filter-branch --prune-empty \
  --index-filter "git ls-files --cached | grep -v -E '$file_list_regex' | xargs -r git rm --cached" \
  --parent-filter "sed s/${common_base_commit}/${subproject_tip}/g" \
  ${common_base_commit}..HEAD

# force reset subproject1 branch to current HEAD
git branch -f subproject1

Explanation:

  • This will only rewrite the last 5 commits (git filter-branch [...] ${common_base_commit}..HEAD) up to master~6 which we consider to be the equivalent commit to subproject1s current tip.
  • For (the first of) those commits it will rewrite its parent from master~6 to subproject1 (--parent-filter 'sed s/${common_base_commit}/${subproject_tip}/g') effectively rebasing the 5 rewritten commits on top of subproject1.
  • Finally we only need to update subproject1 to include the new commits on top of it.

Further optimazation/automation:

  • implement a better logic to list the files you want to include ($file_list_regex) or actually to exclude (git ls-files --cached | grep -v -E '$file_list_regex') from a given subproject
  • make the list of files to include depend on the current commit ($GIT_COMMIT) or check-in the list to the repository itself in case the files to include per subproject may change over time
  • find an automated way to find the 'equivalent' commit of a subproject branches tip in the current master
  • combine all of it in a nice git alias so you can simply use git update-project subproject1
acran
  • 7,070
  • 1
  • 18
  • 35
1

You are looking for git submodules:

It often happens that while working on one project, you need to use another project from within it. Perhaps it’s a library that a third party developed or that you’re developing separately and using in multiple parent projects. A common issue arises in these scenarios: you want to be able to treat the two projects as separate yet still be able to use one from within the other.

The TL;DR on submodules is that they are repos contained within other repos.

The only thing the parent repo knows about the child is the SHA of the last commit that the child told it about, so each repo is managed independent of the other, but they have references to each other which allows you to compose them together.

Here's a well-written blog post from GitHub on the topic.

Choylton B. Higginbottom
  • 2,236
  • 3
  • 24
  • 34
0

Let me first summarize your question:

  • You have a big repository
  • You want to split it into sub-repositories
  • You want to keep the integrity of your history

From your stats I can see that you have 14 sub-projects stored in one master repository. This is usually a very poor solution because remember that every single time someone is cloning the repository, it will also get the full history of all the sub-projects. For instance If I want to contribute to one of your sub-project, I do not want to carry all the 8096 files you have.

If the projects are unrelated to each others, just split them into sub-repositories. With GitHub you can create organizations. Do not hesitate to create your own organization an put all your sub-projects into it. The main advantage is that each sub-project will have:

  • Its own wiki
  • Its own issue tracker
  • Its own front page

If you have related projects which each of them need to be taken from a particular commit. I recommend you to use git submodules. For example if you look at the TortoiseGit project in the ext/ folder, you will notice links to other repositories.

Another solution would be to use git subtree, which seem not the best solution for your problem.

If your master repository falls into any of these categories, you should review your way of using Git:

  • A Git repository is more than 100 MB
  • A Git repository stores artifacts (.exe, .tmp, binaries, generated files, .pdf...)

Is your repository public on GitHub?

nowox
  • 25,978
  • 39
  • 143
  • 293