repo sync
is roughly equivalent to git fetch && git checkout
.
First, it fetches the revision
(if it's a ref) or the upstream
(if the revision
is a sha1 value) specified in the manifest. The ref refers to a commit, and the commit links to its parent commit(s) and recursively till the root commit(s). Each commit refers to a tree object. The tree object refers to other trees, blobs and commits. All of these git objects and refs are git metadata. They are packed and transferred from the remote repository. Among them the blobs take the most space. The transition takes a period of time.
Then after all the repositories finish fetching the necessary data, the specified revision
(if it's a sha1 value) or the head of the revision
(if it's a ref) is checked out. The checkout takes a period of time.
To reduce the time, we could 1) improve the network performance; 2) improve the I/O performance; 3) reduce the fetched data size. In most cases we can do little about the 1st and the 2nd. As for the 3rd, repo
has provided some options for us.
repo init -g <groups>
instructs repo
to download only the repositories specified by -g
. Sometimes we don't need all the repositories.
repo sync -c
instructs repo
to download only the current ref specified by revsision
or upstream
. If a repository has a number of parallel branches or tags, -c
helps to reduce much data. The -c
flag can be specified by sync-c
in the manifest.
repo init --depth=<n>
instructs repo
to do a shallow clone/fetch. It fetches n depths of commits only and thus reduces the number of the related objects. The depth can be specified by clone-depth
in the manifest. Note that in the case where revision
is a sha1 value and upstream
is a branch
, a shallow clone/fetch could fail when the revision
is not within the n depth from the branch head.
Compared to --reference
, all the above are just insignificant skills. To reduce the fetched data size, one idea is to exclude the unnecessary data as much as possible as the above options do, another is to reuse the fetched data as much as possible. This is what --reference
does. To exclude the unnecessary data, we have to carefully think and decide which data are unnecessary, which is quite exhausting. It's much more comfortable to reuse the existing data.
The mirror is a group of existing repositories accessible from the local disk. They hold much data, some of which are necessary for a future repo sync
and some of which are not, but we don't care. To sync with a reference to the mirror, git fetch
reuses the existing data and the server only packs and sends the missing objects and refs which don't exist in the mirror. By and large, the more data the mirror holds, the less data repo sync
fetches.
In development, we may have multiple users that login on to the same machine with different usernames, and there could be multiple machines for more users. We can mount the mirror to all these machines, so that all users (including the CI/CD bot users) can use it.
We need to consider the I/O performance to decide how many mirrors should be created and how many repositories a mirror should have. A backup mirror is necessary. If the mirror corrupts, all the related repo workspaces get disabled. We can regularly update the mirrors so that they always hold as much and as new data as possible. The mirrors can also be used as the data sources for some query services, like an api that gets the changed files of a commit.
Git (also the repo tool) is smart enough to fetch only the missing data that it needs. If it finds some of the data in the mirror, it won't ask the server to repack and resend the redundant data. Accordingly, if it can't, it will ask the server to pack and send the missing data.
Say we are going to sync 4 same repo workspaces, A, B, C and D. If we don't use the reference mirror, the occupied storage size is 480G.
A git metadata 100G + A checkout 20G
B git metadata 100G + B checkout 20G
C git metadata 100G + C checkout 20G
D git metadata 100G + D checkout 20G
The contents of the 4 data are the same. If we just keep one of them and share it with the other 3, we can save 300G. This is what --reference
does . With the help of --reference=/path/to/mirror
, the sizes shrink a lot. To demonstrate it, we assume the mirror is a bit smaller, with only 80G of the metadata. Each workspace needs to fetch the missing 20G data and store them by itself. Now the sizes get 80G * 3 smaller, from 480G down to 240G.
Mirror 80G + A git metadata 20G + A checkout 20G
+ B git metadata 20G + B checkout 20G
+ C git metadata 20G + C checkout 20G
+ D git metadata 20G + D checkout 20G
As the fetched data get a lot smaller (80G for a workspace), the time cost is lowered too, and so is the storage cost of metadata and checkout. As each checkout has its own purpose and they need to exist at the same time, we can hardly reduce their cost. But for some of their repositories, we may use LFS or sparsecheckout to optimize their cost further.
If we update the mirror first, the mirror now has all the necessary data. We can save more. The total size shrinks further to only 180G.
Mirror 100G + A git metadata 0G + A checkout 20G
+ B git metadata 0G + B checkout 20G
+ C git metadata 0G + C checkout 20G
+ D git metadata 0G + D checkout 20G
The number of the repositories and their data size in the mirror could vary by the needs. We can always find a balanced point. In AOSP development, we may have different workspaces which are composed of different repositories. Workspace A has repositories P1, P2, P3. B has P2, P3, P4. C has P1, P2, P3, P4. D has P3, P4, P5. It's okay to define the mirror as P1, P2, P3, P4, P5, or just P2, P3, P4, or other sets.
If we choose the set of P1, P2, P3, P4 and P5, the sample mirror size could be bigger than 100GB. But compared to the saved cost of storage and time, it's still cost-efficient. The worst case is that you only have one workspace, which costs almost the same with or without the reference mirror. Generally, with the help of the reference mirror, the more workspaces there are, the more cost is saved.