How to manage large git repo?

Question

I have a git repositories that is very large [larger than 1 GB] and there is always issue when we have to setup the repositories on new local instance. Is there any proven approach so that we can solve this issue?

Yes, you can remove the large binary files from the repo which are causing it to bloat to 1GB (check SO for how to do this). If you really don't have any such files, and all 1GB is source code, then you must be sitting on a really large codebase. — Tim Biegeleisen, Jan 03 '18 at 06:15
What about partial cloning with subversion (rather than `git`)? — iBug, Jan 03 '18 at 06:18
Possible duplicate of [How to handle a large git repository?](https://stackoverflow.com/questions/12855926/how-to-handle-a-large-git-repository) — Rumid, Jan 03 '18 at 21:35

score 8 · Answer 1 · edited Jun 08 '22 at 20:29

If you don't need full history right away, and you're using a fairly recent version of Git (1.9 or later), then you can do a shallow clone:

git clone --depth 5 user@host:repo.git will truncate repo history to the 5 most recent commits on each branch
git clone --shallow-since=2017-12-01 user@host:repo.git will truncate repo history to everything since 1 December 2017
git clone --shallow-exclude=abc1234 user@host:repo.git will clone every revision except for the specified one and the ones that are reachable from the specified one. You can use --shallow-exclude several times to specify several unwanted revisions.

You can also clone single branches with something like git clone --branch master --single-branch user@host:repo.git, which will only pull down the history of the master branch on the specified repo.

There's a bit more detail at https://www.atlassian.com/blog/git/handle-big-repositories-git which may be helpful - especially if you're dealing with a repo that has large binary assets.

score 3 · Accepted Answer · answered Jan 03 '18 at 06:37

3

Set up a "depot" clone repository with old history that won't change in it on a shared filesystem. Do all your further clones --reference that repo and its contents won't be duplicated to the new clones. Read the clone docs to see usage advice for this, e.g. what to do before losing (or if you might lose) access to your reference depot.

answered Jan 03 '18 at 06:37

jthill

55,082
5
77
137

Will you please elaborate on setting up "depot" clone repository? – prajwal_stha Jan 03 '18 at 07:53
The simplest way is to do a clone to a widely-shared filesystem location and never push, pull or fetch there again, use it only as a reference. You can optionally delete references to more-recent code so all clones referencing it get the recent content in their own object db rather than reading it from the shared depot, if access to the shared fs is slow that will make a difference. – jthill Jan 03 '18 at 15:29

VonC · Answer 3 · 2023-09-03T07:18:16.107

There is now microsoft/scalar (it started three years ago as GVFS, then VFS for Git, which moved in its own repository.
Now, since August 2019, Scalar)

Scalar: A set of tools and extensions for Git to allow very large monorepos to run on Git without a virtualization layer

If your repo is hosted on a service that supports the GVFS Protocol, such as Azure Repos, then scalar clone <url> will create a local enlistment with abilities like on-demand object retrieval, background maintenance tasks, and automatically sets Git config values and hooks that enable performance enhancements.
Scalar also assists in setting up sparse enlistments.

It is integrated with Git for Windows 2.38 (Oct. 2022)

It is documented in Git 2.35 (Q1 2022):

scalar: start documenting the command

^{Signed-off-by: Johannes Schindelin}

Scalar is an opinionated repository management tool.

By creating new repositories or registering existing repositories with Scalar, your Git experience will speed up.
Scalar sets advanced Git config settings, maintains your repositories in the background, and helps reduce data sent across the network.

An important Scalar concept is the enlistment: this is the top-level directory of the project.
It usually contains the subdirectory src/ which is a Git worktree. This encourages the separation between tracked files (inside src/) and untracked files, such as build artifacts (outside src/).

When registering an existing Git worktree with Scalar whose name is not src, the enlistment will be identical to the worktree.

The scalar command implements various subcommands, and different options depending on the subcommand.

Again: It is integrated with Git for Windows 2.38 (Oct. 2022)

With Git 2.27 (Q2 2020), "git fetch" offers a better support for scalar clone.

It also explains how scalar clone differs from a regular git clone and will handle larger repositories.

See commit b739d97 (13 Mar 2020) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit 4cd9bb4, 25 Mar 2020)}

connected.c: reprepare packs for corner cases

^{Helped-by: Jeff King}
^{Helped-by: Junio Hamano}
^{Signed-off-by: Derrick Stolee}

While updating the microsoft/git fork on top of v2.26.0-rc0 and consuming that build into Scalar, I noticed a corner case bug around partial clone.

The "scalar clone" command can create a Git repository with the proper config for using partial clone with the "blob:none" filter.
Instead of calling "git clone", it runs "git init" then sets a few more config values before running "git fetch".

In our builds on v2.26.0-rc0, we noticed that our "git fetch" command was failing with
error: https://github.com/microsoft/scalar did not send all necessary objects
This does not happen if you copy the config file from a repository created by "git clone --filter=blob:none <url>", but it does happen when adding the config option "core.logAllRefUpdates = true".

By debugging, I was able to see that the loop inside check_connnected() that checks if all refs are contained in promisor packs actually did not have any packfiles in the packed_git list.

I'm not sure what corner-case issues caused this config option to prevent the reprepare_packed_git() from being called at the proper spot during the fetch operation. This approach requires a situation where we use the remote helper process, which makes it difficult to test.

It is possible to place a reprepare_packed_git() call in the fetch code closer to where we receive a pack, but that leaves an opening for a later change to re-introduce this problem.
Further, a concurrent repack operation could replace the pack-file list we already loaded into memory, causing this issue in an even harder to reproduce scenario.

It is really the responsibility of anyone looping through the list of pack-files for a certain object to fall back to reprepare_packed_git() on a fail-to-find. The loop in check_connected() does not have this fallback, leading to this bug.

We _could_ try looping through the packs and only reprepare the packs after a miss, but that change is more involved and has little value.
Since this case is isolated to the case when opt->check_refs_are_promisor_objects_only is true, we are confident that we are verifying the refs after downloading new data. This implies that calling reprepare_packed_git() in advance is not a huge cost compared to the rest of the operations already made.

With Git 2.35 (Q1 2022), add pieces from "scalar" to contrib/.

See commit ddc35d8, commit 4582676, commit cb59d55, commit 4368e40, commit 546f822, commit f5f0842, commit 9187659, commit 829fe56, commit 0a43fb2, commit cd5a9ac (03 Dec 2021) by Johannes Schindelin (dscho).
See commit d85ada7 (03 Dec 2021) by Matthew John Cheetham (mjcheetham).
See commit 7020c88, commit 2b71045, commit c76a53e, commit d0feac4 (03 Dec 2021) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit 62e83d4, 21 Dec 2021)}

scalar: implement the clone subcommand

^{Signed-off-by: Johannes Schindelin}

This implements Scalar's opinionated clone command: it tries to use a partial clone and sets up a sparse checkout by default.
In contrast to git clone^(man), scalar clone sets up the worktree in the src/ subdirectory, to encourage a separation between the source files and the build output (which helps Git tremendously because it avoids untracked files that have to be specifically ignored when refreshing the index).

Also, it registers the repository for regular, scheduled maintenance, and configures a flurry of configuration settings based on the experience and experiments of the Microsoft Windows and the Microsoft Office development teams.

Note: since the scalar clone command is by far the most commonly called scalar subcommand, we document it at the top of the manual page.

Git 2.36 (Q2 2022) includes new options for git scalar:

See commit 2ae8eb5 (28 Jan 2022) by Johannes Schindelin (dscho).
^{(Merged by Junio C Hamano -- gitster -- in commit ff6f169, 17 Feb 2022)}

scalar: accept -C and -c options before the subcommand

^{Signed-off-by: Johannes Schindelin}

The git executable has these two very useful options:

-C <directory>:
switch to the specified directory before performing any actions

-c <key>=<value>:
temporarily configure this setting for the duration of the specified scalar subcommand

With this commit, we teach the scalar executable the same trick.

Git 2.37 (Q3 2022) comes with the implementation of "scalar diagnose" subcommand.

See commit 15d8adc, commit 93e804b (28 May 2022) by Matthew John Cheetham (mjcheetham).
See commit 0ed5b13, commit aa5c79a, commit b448557, commit de1f68a, commit 237a1d1 (28 May 2022) by Johannes Schindelin (dscho).
See commit 23f2356 (28 May 2022) by Junio C Hamano (gitster).
^{(Merged by Junio C Hamano -- gitster -- in commit 08baf19, 07 Jun 2022)}

scalar: implement scalar diagnose

^{Signed-off-by: Johannes Schindelin}

Over the course of Scalar's development, it became obvious that there is a need for a command that can gather all kinds of useful information that can help identify the most typical problems with large worktrees/repositories.

The diagnose command is the culmination of this hard-won knowledge: it gathers the installed hooks, the config, a couple statistics describing the data shape, among other pieces of information, and then wraps everything up in a tidy, neat .zip archive.

Note: originally, Scalar was implemented in C# using the .NET API, where we had the luxury of a comprehensive standard library that includes basic functionality such as writing a .zip file.
In the C version, we lack such a commodity.
Rather than introducing a dependency on, say, libzip, we slightly abuse Git's archive machinery: we write out a .zip of the empty try, augmented by a couple files that are added via the --add-file* options.
We are careful trying not to modify the current repository in any way lest the very circumstances that required scalar diagnose to be run are changed by the diagnose run itself.

With Git 2.38 (Q3 2022), scalar goal is rephrased:

See commit 72d3a5d, commit f22c95d (12 Jul 2022) by Victoria Dye (vdye).
^{(Merged by Junio C Hamano -- gitster -- in commit 3a03633, 27 Jul 2022)}

scalar: reword command documentation to clarify purpose

^{Signed-off-by: Victoria Dye}
^{Acked-by: Derrick Stolee}

Rephrase documentation to describe scalar as a "large repo management tool" rather than an "opinionated management tool".
The new description is intended to more directly reflect the utility of scalar to better guide users in preparation for scalar being built and installed as part of Git.

New description:

scalar - A tool for managing large Git repositories

Scalar is a repository management tool that optimizes Git for use in large repositories or registering existing repositories with Scalar, your Git repositories.

Scalar improves performance by configuring advanced Git settings, experience will speed up.
Scalar sets advanced Git config settings, maintaining repositories in the background, and helping to reduce data sent.

With Git 2.38 (Q3 2022), the "diagnose" feature to create a zip archive for diagnostic material has been lifted from "scalar" and made into a feature of git bugreport".

See commit 43370b1, commit 672196a, commit aac0e8f, commit 7ecf193, commit 6783fd3, commit 33cba72, commit bb2c349, commit 435a253, commit ba307a5, commit 91be401, commit 81ad551 (12 Aug 2022) by Victoria Dye (vdye).
^{(Merged by Junio C Hamano -- gitster -- in commit f00ddc9, 25 Aug 2022)}

builtin/diagnose.c: create 'git diagnose' builtin

^{Helped-by: Ævar Arnfjörð Bjarmason}
^{Helped-by: Derrick Stolee}
^{Signed-off-by: Victoria Dye}

Create a 'git diagnose'^(man) builtin to generate a standalone zip archive of repository diagnostics.

The "diagnose" functionality was originally implemented for Scalar in aa5c79a ("scalar: implement scalar diagnose", 2022-05-28, Git v2.37.0-rc0 -- merge listed in batch #8).
However, the diagnostics gathered are not specific to Scalar-cloned repositories and can be useful when diagnosing issues in any Git repository.

git diagnose now includes in its man page:

git-diagnose(1)

NAME

git-diagnose - Generate a zip archive of diagnostic information

SYNOPSIS

[verse] 'git diagnose' [(-o | --output-directory) ] [(-s | --suffix) ]

DESCRIPTION

Collects detailed information about the user's machine, Git client, and repository state and packages that information into a zip archive. The generated archive can then, for example, be shared with the Git mailing list to help debug an issue or serve as a reference for independent debugging.

The following information is captured in the archive:

'git version --build-options'

The path to the repository root

The available disk space on the filesystem

The name and size of each packfile, including those in alternate object stores

The total count of loose objects, as well as counts broken down by .git/objects subdirectory

This tool differs from git bugreport in that it collects much more detailed information with a greater focus on reporting the size and data shape of repository contents.

OPTIONS

-o <path>

--output-directory <path>

Place the resulting diagnostics archive in <path> instead of the current directory.

-s <format>

--suffix <format>

Specify an alternate suffix for the diagnostics archive name, to create a file named 'git-diagnostics-<formatted suffix>'. This should take the form of a strftime(3) format string; the current local time will be used.

It is integrated with Git for Windows 2.38 (Oct. 2022)

With Git 2.39 (Q4 2022), 'scalar reconfigure -a' is taught to automatically remove scalar.repo entries which no longer exist.

See commit a90085b (10 Nov 2022), and commit c90db53 (07 Nov 2022) by Johannes Schindelin (dscho).
^{(Merged by Junio C Hamano -- gitster -- in commit 58d80df, 23 Nov 2022)}

scalar reconfigure -a: remove stale scalar.repo entries

^{Signed-off-by: Johannes Schindelin}
^{Signed-off-by: Taylor Blau}

Every once in a while, a Git for Windows installation fails because the attempt to reconfigure a Scalar enlistment failed because it was deleted manually without removing the corresponding entries in the global Git config.

In f5f0842 ("scalar: let 'unregister' handle a deleted enlistment directory gracefully", 2021-12-03, Git v2.35.0-rc0 -- merge listed in batch #4), we already taught scalar delete to handle the case of a manually deleted enlistment gracefully.
This patch adds the same graceful handling to scalar reconfigure --all.

With Git 2.40 (Q1 2023), "scalar clone" learned to give progress bar.

See commit 4433bd2 (11 Jan 2023) by ZheNing Hu (adlternative).
^{(Merged by Junio C Hamano -- gitster -- in commit ebed06a, 23 Jan 2023)}

scalar: show progress if stderr refers to a terminal

^{Signed-off-by: ZheNing Hu}
^{Acked-by: Derrick Stolee}

Sometimes when users use scalar to download a monorepo with a long commit history, they want to check the progress bar to know how long they still need to wait during the fetch process, but scalar suppresses this output by default.

So let's check whether scalar stderr refer to a terminal, if so, show progress, otherwise disable it.

With Git 2.43 (Q4 2023), scalar includes an option to not use the src/ folder.

See commit f9a547d, commit 26ae8da, commit 4527db8 (28 Aug 2023) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit 19cb1fc, 29 Aug 2023)}

scalar: add --[no-]src option

^{Signed-off-by: Derrick Stolee}

Some users have strong aversions to Scalar's opinion that the repository should be in a 'src' directory, even though this creates a clean slate for placing build artifacts in adjacent directories.

The new --no-src option allows users to opt out of the default behavior.

While adding options, make sure the usage output by 'scalar clone -h' reports the same as the SYNOPSIS line in Documentation/scalar.txt.

scalar now includes in its man page:

--[no-]src

By default, scalar clone places the cloned repository within a <entlistment>/src directory. Use --no-src to place the cloned repository directly in the <enlistment> directory.

How to manage large git repo?

3 Answers3

`scalar`: start documenting the command

`connected.c`: reprepare packs for corner cases

`scalar`: implement the `clone` subcommand

`scalar`: accept `-C` and `-c` options before the subcommand

`scalar`: implement `scalar diagnose`

`scalar`: reword command documentation to clarify purpose

scalar - A tool for managing large Git repositories

`builtin/diagnose.c`: create '`git diagnose`' builtin

git-diagnose(1)

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

`-o <path>`

`--output-directory <path>`

`-s <format>`

`--suffix <format>`

`scalar reconfigure -a`: remove stale `scalar.repo` entries

`scalar`: show progress if stderr refers to a terminal

`scalar`: add `--[no-]src` option

`--[no-]src`

How to manage large git repo?

3 Answers3

scalar: start documenting the command

connected.c: reprepare packs for corner cases

scalar: implement the clone subcommand

scalar: accept -C and -c options before the subcommand

scalar: implement scalar diagnose

scalar: reword command documentation to clarify purpose

scalar - A tool for managing large Git repositories

builtin/diagnose.c: create 'git diagnose' builtin

git-diagnose(1)

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

-o <path>

--output-directory <path>

-s <format>

--suffix <format>

scalar reconfigure -a: remove stale scalar.repo entries

scalar: show progress if stderr refers to a terminal

scalar: add --[no-]src option

--[no-]src

`scalar`: start documenting the command

`connected.c`: reprepare packs for corner cases

`scalar`: implement the `clone` subcommand

`scalar`: accept `-C` and `-c` options before the subcommand

`scalar`: implement `scalar diagnose`

`scalar`: reword command documentation to clarify purpose

`builtin/diagnose.c`: create '`git diagnose`' builtin

`-o <path>`

`--output-directory <path>`

`-s <format>`

`--suffix <format>`

`scalar reconfigure -a`: remove stale `scalar.repo` entries

`scalar`: show progress if stderr refers to a terminal

`scalar`: add `--[no-]src` option

`--[no-]src`