How to handle a Git repository with >100k test files?

Question

I am migrating a fairly big project to Git. The project divides in ~30k source code files and ~100k unit test files.

I see two options for the migration:

(1) Put all files into one repository. The huge number of files will make git operations slow (see here). Slow operations will annoy my developers (especially because they work on Windows where Git is slower in general). BTW: File size is no issue for this project.

(2) Put the test files into an own repository with a Git submodule. This will annoy my developers because they always have to perform 2 commits when they fix a bug.

How do you deal with this kind of situation? Is there a third way that I am not seeing?

Thanks!

Isn't there a more logical split you can make that keeps code and tests together? But splits across functional or technical layers? — jessehouwing, Jul 21 '15 at 09:19
@jessehouwing That would be ideal. Unfortunately it is unrealistic to refactor the code base during the migration. I guess these tests are more "integration tests" than "unit tests", too. — Lars Schneider, Jul 21 '15 at 10:04

score 3 · Answer 1 · answered Jan 27 '20 at 15:43

This answer is too late for the original question, but might be of interest for others in similar situations:

Given the amount of files, I'd also suggest a submodule-based approach. However, working with submodules in Git can be cumbersome. One way to mitigate this would be using a DataLad dataset.

DataLad builds up on Git (and git-annex, but having annexed contents or a dataset annex at all is completely optional) and a dataset is always a Git repository. Compared to working with Git only, DataLad datasets and commands have the advantage of recursive operations through dataset hierarchies, though. Transitioning to a DataLad dataset would thus make working with submodules much easier, while still keeping everything Git repositories, and Git-based workflows valid and functional.

score 0 · Answer 2 · edited May 23 '17 at 12:00

0

I would still recommend solution 2, as submodules are made for this kind of repo structure.

The other approach would be for each developer to specify the folders they need and do a sparse checkout, possible combined with a shallow clone, in order to minimize the size of the local repo.
That way, you are still dealing with only one Git repo, but only using the part of it you actually need.

edited May 23 '17 at 12:00

Community

1
1

answered Jul 21 '15 at 09:20

VonC

1,262,500
529
4,410
5,250

Thanks for your reply. I rather want to avoid sparse checkout and shallow clone as I consider them "advanced" Git features. Most of our devs are fairly new to Git. I hope you are OK if I keep the question open a little longer before I accept your answer. – Lars Schneider Jul 21 '15 at 10:07
@LarsSchneider I understand. Not that those features only work well with recent version of Git. On Windows, you would need to use the latests from https://github.com/git-for-windows/git/releases/ – VonC Jul 21 '15 at 10:14

How to handle a Git repository with >100k test files?

2 Answers2