Merging Git-Repos after migration from RTC-Jazz

Question

We have been working with Jazz-RTC for around 15 years and are forced to migrate to git in a short time-frame.

Our workflow was such, that we created streams for each release containing components, that represented the different folders of the project, i.e. server, gui, doc, db, etc.

Over time we added new components to newer streams so that the code-base now looks something like this:

V1.0
 |_server
 |_gui
 |_db
V2.0
 |_server
 |_gui
 |_db
 |_doc
V2.5
 |_server
 |_gui
 |_db
 |_doc
V3.0
 |_server
 |_gui
 |_db
 |_doc
 |_reports
....

Our migration script is working in a way such that for each stream (V2.0, V3.0,...) it takes each component (server, gui, ...) and creates a separate Git repository from it.

The change-sets are applied as commits in each respective repository so that we have retained the history for every component. This also means, that we have no branches in each repository, just a linear commit-history on a single (master) branch.

It's obvious, that there is duplicated code in the Git repos. E.g. in V2.0 the server repo has mostly similar files from the V3.0 server repo, with only minor changes on some files.

What we'd like to do now, is to combine these different Git repositories into one, so that the structure looks something like this:

Combined_Project
  |_server
  |_gui
  |_db
  |_doc
  |_reports

Of course we need the history of file-changes (i.e. commits) to be in the right order (ordered by Date).

In order to achieve this task we would appreciate any Git-internal solution but we would also accept using third party tools.

I have researched this topic for days now, but the more info I find about it, the more confused I get.

Doing a simple git remote add -f V2.0gui <gui-from-other-repo> followed by git merge V2.0gui/master creates a merge-commit and merges the repositories but in the logs I see, that the commits are not in the right order (e.g we have commits from March 2022 that come before commits from January 2022).

I have tried to rebase the "remote" repositories into a common repo but this also messes up the commit history.

The question is, how would this task be tackled in the best way? What tools or strategies would you use?

Update: As the whole code has been worked on in a linear fashion, it would suffice to have one Git repository with no branches as a result. This means, that the commits of the different repositories should be all on the master branch of the resulting repo (depending on their date of check-in/commit).

VonC · Answer 1 · 2022-08-21T00:19:09.713

2

I had something more like this (as a result) in my mind: One Repository, no branches, just master

For doing a lot of RTC to Git migration these days, I can attest to never follow that approach.

Instead:

one repository per UCM component
generally only one stream is imported, as main in the new git repository
Only a few baselines from the RTC stream are imported, as shown here.

cd /path/to/git/repo
git add --work-tree=/path/to/local/RTC/sandbox/aComponent add .
git commit -m "release x"
# change baseline in local workspace

edited Aug 21 '22 at 00:19

answered Aug 20 '22 at 18:14

VonC

1,262,500
529
4,410
5,250

The problem is, that the components need to be in the same repository. The team does not want to do a `git pull` ten times for ten components. Also they are logically part of the same application so keeping them in different repositories is not an option. We have used streams as some kind of release "branches" in RTC. I have migrated the latest stream (with it's 10 components) into 10 Git repos and I was able to preserve all the change-sets as commits in their respective Git repo. Now I want to merge those repos into one repo and keep the combined history of all the components – m4110c Aug 22 '22 at 08:49
1

@m4110c No need for "`git pull` ten times: you simply make a parent repository in which you add your "components" (individual git repositories) as [submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules). That allows for a common development, while keeping each repository independent. It is the equivalent of adding RTC components on a Stream. – VonC Aug 22 '22 at 08:51
That sounds like a possible solution to me. Another one I just discovered is this answer about merging Git repositories while keeping their history intact (keep in mind that I have already migrated all the components into separate Git repos): https://stackoverflow.com/questions/13040958/merge-two-git-repositories-without-breaking-file-history What is your opinion about the approach in the accepted answer? – m4110c Aug 22 '22 at 08:56
1

@m4110c That would be an approach (called "monorepo") I would not recommend. It leads to a Git repository too big to be easily manageable, and with tags or branches applied to *all* components instead of individual Git repositories, for targeted development. – VonC Aug 22 '22 at 08:59
Ah now I get your point! Sure, I would keep separate repositories for logically separate parts of the application anyway... The submodules idea sounds very good though. I'll look into it. Thanks for your help! – m4110c Aug 22 '22 at 09:12

score 1 · Answer 2 · answered Aug 20 '22 at 10:14

Commits connect directly to previous commits, by hash ID. This forms the Directed Acyclic Graph (DAG) of commits. Commits are also immutable, so to combine two separate Git repositories with two separate graphs:

A--B--C   <-- master   [in repo 1]

D--E--F--G--H   <-- master   [in repo 2]

into a single combined repository with a graph such as:

      C'  <-- v1-branch
     /
AD-BE
     \
      F'-G'-H'  <-- v2-branch

where AD is either A or D (because they're essentially identical) and BE is either B or E (for the same reason), you can literally copy A-and-B, or D-and-E (but not A-and-E for instance since E always points back to D) into a fresh, new, empty repository, but then when you go to copy C, you may have to replace it with a new C' that's like C (snapshot) but different from C (different parent hash ID). If you took A-B as is, you can take C as-is, but now you have to replace F with F' so that it points back to B instead of E, and then you have to replace G with G' so that it points back to F', and so on.

The two tools that Git comes with—well, one tool that it comes with, one that you can get for it—that do this sort of thing are git filter-branch and git filter-repo. Filter-branch is hard to use correctly. Filter-repo generally requires a little more code-writing as it's a Python script that will evaluate your own Python code.

In this particular case, you might want to just take the existing filter-repo code and rework it to read multiple input repositories and figure out on its own which commits to join with which previous commits. This won't be easy, no matter how you go about doing it.

I had something more like this (as a result) in my mind: One Repository, no branches, just master: `A-D-B-E-C-F-G-H` (Depending on the date, when they were committed). Wouldn't that make things easier? For those small parts where we deviated from the linear progression I could create branches (not many needed) and commit the changes manually onto them. — m4110c, Aug 20 '22 at 11:50
If there are never any branches, it does get easier in that there's only one thing to pick as the "next commit". But if there are any branches—if the history ever diverges—you're back to the general case, which will be tricky. — torek, Aug 20 '22 at 11:52
Yes, that's what I thought. We diverged very seldomly (mostly for specific bug-fixes or testing/logging stuff) and created special streams each time. As those streams have mostly minor code changes I imagined that I could recreate them by hand. The main-line is most important and cumbersome because we have many streams (~20) with up to 10 components each. — m4110c, Aug 20 '22 at 11:55

Merging Git-Repos after migration from RTC-Jazz

2 Answers2