How does repo init --reference actually help reduce the build space

Question

We are working on Android-S and saw that each CI user uses nearly 385 GB of space on there local FS, when doing a "repo init " and repo Sync" operation. we wanted to optimize the storage and the network speed and as per the man pages we observed that this can be achieved using the --reference command.

As per the documentation:

The --reference option can be used to point to a directory that has the content of a --mirror sync. This will make the working directory use as much data as possible from the local reference directory when fetching from the server. This will make the sync go a lot faster by reducing data traffic on the network.

First Clone:

We followed the steps and created the first reference copy and saw that the initial space occupied was close to 385 GB.

User-1 The first user now did a repo init and repo sync and we saw the total time for this activity was only 23 minutes but it also reduced the size considerably to 63 GB.

user-2 The second user now did a repo init and repo sync and we saw the total time for this activity was only 23 minutes and it also reduced the size to 63 GB.

I do see the network performance improvement but wondering how the actual size moved from 385 GB to 63 GB and what is actually there in the 63 GB and what is the real concept behind the reference option WRT the space reduction.

Command used:

export Mirror="/data/Android-s"

repo init -u ssh://$US...@android1.test.com:29418/android/manifest -b tmainline -m t-r-mainline.xml --repo-url=ssh://android1.test.com:29418/android1/repo --repo-branch=test-stable --no-repo-verify --reference=$Mirror

Any leads or documentation around it is really helpful on how the space is getting crunched and if this can also lead to any issues during the build and any precautions to be taken during this command operation.

Thank you, Anish

Related: [What are the differences between git clone --shared and --reference?](https://stackoverflow.com/q/23304374/295004) — Morrison Chang, Nov 24 '21 at 07:49
This is just the documentation but does not actually explain what is getting copied and how the space is getting reduced........What i would like to know is what is the criteria for it to get reduced and what exactly is getting copied which shows 63 GB of FS — anish anil, Nov 24 '21 at 10:16
You may want to read: [10.1 Git Internals - Plumbing and Porcelain](https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain) for lower level details of how git works. — Morrison Chang, Nov 25 '21 at 23:52

ElpieKay · Answer 1 · 2021-11-29T01:54:32.163

repo sync is roughly equivalent to git fetch && git checkout.

First, it fetches the revision (if it's a ref) or the upstream (if the revision is a sha1 value) specified in the manifest. The ref refers to a commit, and the commit links to its parent commit(s) and recursively till the root commit(s). Each commit refers to a tree object. The tree object refers to other trees, blobs and commits. All of these git objects and refs are git metadata. They are packed and transferred from the remote repository. Among them the blobs take the most space. The transition takes a period of time.

Then after all the repositories finish fetching the necessary data, the specified revision (if it's a sha1 value) or the head of the revision (if it's a ref) is checked out. The checkout takes a period of time.

To reduce the time, we could 1) improve the network performance; 2) improve the I/O performance; 3) reduce the fetched data size. In most cases we can do little about the 1st and the 2nd. As for the 3rd, repo has provided some options for us.

repo init -g <groups> instructs repo to download only the repositories specified by -g. Sometimes we don't need all the repositories.

repo sync -c instructs repo to download only the current ref specified by revsision or upstream. If a repository has a number of parallel branches or tags, -c helps to reduce much data. The -c flag can be specified by sync-c in the manifest.

repo init --depth=<n> instructs repo to do a shallow clone/fetch. It fetches n depths of commits only and thus reduces the number of the related objects. The depth can be specified by clone-depth in the manifest. Note that in the case where revision is a sha1 value and upstream is a branch, a shallow clone/fetch could fail when the revision is not within the n depth from the branch head.

Compared to --reference, all the above are just insignificant skills. To reduce the fetched data size, one idea is to exclude the unnecessary data as much as possible as the above options do, another is to reuse the fetched data as much as possible. This is what --reference does. To exclude the unnecessary data, we have to carefully think and decide which data are unnecessary, which is quite exhausting. It's much more comfortable to reuse the existing data.

The mirror is a group of existing repositories accessible from the local disk. They hold much data, some of which are necessary for a future repo sync and some of which are not, but we don't care. To sync with a reference to the mirror, git fetch reuses the existing data and the server only packs and sends the missing objects and refs which don't exist in the mirror. By and large, the more data the mirror holds, the less data repo sync fetches.

In development, we may have multiple users that login on to the same machine with different usernames, and there could be multiple machines for more users. We can mount the mirror to all these machines, so that all users (including the CI/CD bot users) can use it.

We need to consider the I/O performance to decide how many mirrors should be created and how many repositories a mirror should have. A backup mirror is necessary. If the mirror corrupts, all the related repo workspaces get disabled. We can regularly update the mirrors so that they always hold as much and as new data as possible. The mirrors can also be used as the data sources for some query services, like an api that gets the changed files of a commit.

Git (also the repo tool) is smart enough to fetch only the missing data that it needs. If it finds some of the data in the mirror, it won't ask the server to repack and resend the redundant data. Accordingly, if it can't, it will ask the server to pack and send the missing data.

Say we are going to sync 4 same repo workspaces, A, B, C and D. If we don't use the reference mirror, the occupied storage size is 480G.

A git metadata 100G + A checkout 20G
B git metadata 100G + B checkout 20G
C git metadata 100G + C checkout 20G
D git metadata 100G + D checkout 20G

The contents of the 4 data are the same. If we just keep one of them and share it with the other 3, we can save 300G. This is what --reference does . With the help of --reference=/path/to/mirror, the sizes shrink a lot. To demonstrate it, we assume the mirror is a bit smaller, with only 80G of the metadata. Each workspace needs to fetch the missing 20G data and store them by itself. Now the sizes get 80G * 3 smaller, from 480G down to 240G.

Mirror 80G + A git metadata 20G + A checkout 20G
           + B git metadata 20G + B checkout 20G
           + C git metadata 20G + C checkout 20G
           + D git metadata 20G + D checkout 20G

As the fetched data get a lot smaller (80G for a workspace), the time cost is lowered too, and so is the storage cost of metadata and checkout. As each checkout has its own purpose and they need to exist at the same time, we can hardly reduce their cost. But for some of their repositories, we may use LFS or sparsecheckout to optimize their cost further.

If we update the mirror first, the mirror now has all the necessary data. We can save more. The total size shrinks further to only 180G.

Mirror 100G + A git metadata 0G + A checkout 20G
            + B git metadata 0G + B checkout 20G
            + C git metadata 0G + C checkout 20G
            + D git metadata 0G + D checkout 20G

The number of the repositories and their data size in the mirror could vary by the needs. We can always find a balanced point. In AOSP development, we may have different workspaces which are composed of different repositories. Workspace A has repositories P1, P2, P3. B has P2, P3, P4. C has P1, P2, P3, P4. D has P3, P4, P5. It's okay to define the mirror as P1, P2, P3, P4, P5, or just P2, P3, P4, or other sets.

If we choose the set of P1, P2, P3, P4 and P5, the sample mirror size could be bigger than 100GB. But compared to the saved cost of storage and time, it's still cost-efficient. The worst case is that you only have one workspace, which costs almost the same with or without the reference mirror. Generally, with the help of the reference mirror, the more workspaces there are, the more cost is saved.

It is a Nice Read.............But still it fails to help me understand why the space of the user machine is 5x Times reduced when compared to the --reference clone disk space. :-( — anish anil, Nov 25 '21 at 12:26
Awesome ............You are too Good. I had to do couple of reads to get to the depth of it but Honestly the kind of information provided is so indepth and great. You rock :-) — anish anil, May 31 '22 at 05:46

How does repo init --reference actually help reduce the build space

1 Answers1