1

our company have decided to migrate our source code from clearcase to git, that's great :-)

I know that clearcase and git are completely different source code management systems. But we developer, would have only one SCM that containing the complete history.

My colleague found the following tool, which importing our clearcase history into git: https://github.com/charleso/git-cc

Unfortunately our code has more than 46000 source code files and the history to import is more than 10 years.

I analyzed this tool and in my opinion there are two bottlenecks. The first is the import of files from clearcase server. This is easy to solve by doing this in multiple threads. The second is the workflow of git-cc itself.

  1. Get history of master-branch via cleartool lshistory
  2. Create changesets of files and group them to comit's
  3. Get specified version of file(s) from cc server and copy to working directory
  4. git add .
  5. git commit
  6. pick next group and start with 3. again

I think I could improve it by using low level git commands and using multiple threads.

Each commit-group queries its changes from server and creating a blob object within git database, so this could run for multiple groups in multiple threads. Additional I have one thread which create the history in git from just now created blob objects.

My question is now, does this make sense to you or do you think I'm naive?

Have I forget any git locking mechanism?

Have you any other ideas?

David Jones
  • 4,766
  • 3
  • 32
  • 45
jungnick
  • 11
  • 1
  • Usually for importing you use [git-fast-import](https://git-scm.com/docs/git-fast-import). I don't know if it utilizes multiple CPU (maybe not, as it gets all data as single binary stream), but at least it does not do any extra IO. – max630 Sep 16 '17 at 14:53
  • @max630 Hi max630, thanks for your reply. That is a good point to avoid any extra I/O. I will evalutate it. – jungnick Sep 17 '17 at 05:44

1 Answers1

0

Using multiple thread for importing commits in the same branch of a Git repo is risky (unless, as you put it, you create "blob object", that is patches that you can replay).

But using multiple thread for commits on different branches is possible: you create different repo, each one for a branch import, and then you can fetch those repos into one common repo and reattach them with git replace or grafts.

But remember: each Git repo is a component, so if your giant ClearCase Vob includes several components (group of files), it would be best to separate them in multiple Git repo rather than attempting to create a giant Git one.
I detail that in "ClearCase to Git migration".

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Hi VonC, thanks for your reply. I am not sure why it should be risky to create blob objects in parallel? Importing commits in parallel is clearly dangerous, for this I want to use one dedicated thread, which build up my history from blob objects. In the end I have n-threads who reading my clearcase history and creating dangling blob objects and one dedicated thread for building my history. Another team do an evaluation how we want to migrate our VOB structure into multiple smaller repos. Currently we evaluate submodules and subtrees – jungnick Sep 17 '17 at 05:42
  • @jungnick I describe the difference between submodule and subtree here: https://stackoverflow.com/a/31770147/6309. And give an example of subtree there: https://stackoverflow.com/a/24709789/6309. I prefer submodules, as I can reference a fixed point in history, allowing the parent repo to be cloned in a known state. – VonC Sep 17 '17 at 07:59