unable to migrate objects to permanent storage

Question

I have read several explanations about this git failure message, honestly: I was not able to understand them. Though I solved the problem for my purposes, I'd like to ask for understanding why git does the permission on the repository server as I explored it:

For some reason git needs an objects folder in the repository on its ssh server where users push or pull. For some reason (?) git creates folders below objects with random names in the range of 01 to ff.

The problem is, that the folders below objects are owned by the actual user who pushes a new revision to that repository with an access permission of 775 - that is only that user and users in his group can write to that folder.

As the repository is used for a while, the 255 possible names colide. A folder name below objects will have to be re-used in the course of a push. If the pushing user is not the same as that one who created the folder (e.g. weeks ago) they will see a unable to migrate objects to permanent storage failure message because of the mentioned permission violation.

A solution is, that the pushing user becomes a member of the group of the user who push the last time by using the same objects folder name.

Is there any mechanism for the git ssh server to avoid such a conflict in a more convenient way? How can I avoid having to add a new user to all the groups of all existing users in order to prepare avoiding permission conflicts for objects folders?

[`git config core.sharedRepository=0660`](https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresharedRepository); https://stackoverflow.com/search?q=%5Bgit%5D+sharedrepository — phd, Nov 27 '20 at 10:03
@phd great! Compehension! If default is non-shared, why many user can push to the repo until this confusing error message happens? Did I do something wrong initially? What I like to do is just enter the pub-key of a new user to authorized_keys of the git-user. — ngong, Nov 27 '20 at 16:28
I suspect you did `chmod -R g+w`. Actually you need this one more time. After setting `sharedRespoitory` Git will support group-writeability itself. — phd, Nov 27 '20 at 18:45

score 1 · Answer 1 · answered Nov 27 '20 at 09:51

I can easily think of 2 ways:

Use a single user for all developers and use keys to authenticate them, then there is no need to share a password and they can all live happily ever after.

Another approach would be using setgid on the repo so that the group of any file created inside the repo remains consistent. Then it becomes a question of keeping developers in the correct group and making sure that their umask setting will allow rw at the group level when a user creates a new file/object in a repo. https://en.m.wikipedia.org/wiki/Setuid

score 1 · Accepted Answer · answered Nov 27 '20 at 21:00

Between phd's comments about setting modes, and eftshift0's answer, you have some practical approaches for dealing with this already. Here's the theory that backs up these practical answers.

For some reason git needs an objects folder in the repository on its ssh server where users push or pull. For some reason (?) git creates folders below objects with random names in the range of 01 to ff.

It's actually 00 through ff. There's something a bit odd going on in your case; we'll explore this in a bit.

The first thing to realize is that Git doesn't store files. What Git stores, in its main database—we'll come back to this main word later—is objects. These objects have hash IDs: names of the form faefdd61ec7c7f6f3c8c9907891465ac9a2a1475, for instance. The hash IDs you commonly see—though often abbreviated as, e.g., faefdd61e for instance—are those of commit objects, but there are in fact four object types. The first one is of course the commit; the remaining three are tree, blob, and annotated tag.

File contents go into the blob objects. File names get divided into name components, in the familiar directory-and-file-name style from Unix/Linux systems, by slashes; these name components, plus additional information as needed, go into tree objects; and a commit object then refers to a tree object to hold the data—the files—for the commit, in Git's compressed and de-duplicated object-store form. Annotated tag objects exist so that annotated tags can store data as well as a commit hash ID (or any other object hash ID, though it's unusual to have an annotated tag object that points to anything other than a commit object).

Hence, the main database of any Git repository is this object database. Objects themselves can be stored either as loose objects or their opposite: packed objects (not tight objects, although the packing does pack them pretty tightly ). Packed objects are stored in a pack file, and the pack files live in the objects directory under a subdirectory named pack. Your .git/objects/pack should contain one or more *.pack files, each of which also has a corresponding *.idx file. We'll come back to pack files in a bit.

Loose objects are stored with each object in a stand-alone file-system-level file. The object's name might be dd1cf41e007a0036e18eef4b0acae505ec52f168. If this is to be stored as a loose object, rather than a packed one, its file system level name will be dd/1cf41e007a0036e18eef4b0acae505ec52f168. We simply take the first two characters of the hexadecimal expansion of the hash ID off the front and use them as a directory name, and use the remaining characters as the file name.

The choice of two characters here has to do with the expected "fluffiness" and "fullness" of the loose-objects directories, and the performance (or lack thereof) of the original Linux file systems when using directories with a lot of files in them. If all loose objects were dropped into a single directory, that directory would accumulate about two to six thousand files before Git would "pack" the objects. The choice of how many loose files to leave is complicated and included at least a little guesswork, plus file activity patterns from the early 2000s, so these numbers don't necessarily all make sense today, but that's what Linus Torvalds did at the time, and it remains in place because it works well.¹

When users run git push (but not git pull), their Git calls up some other Git. Their Git reads their Git repository. The server Git reads, and writes, the server's repository. Their Git figures out which commit objects they have, that the server lacks, and sends over these commit objects. The two Gits coordinate and the sending Git can also figure out what other objects are required.

Once the sender has the list of all objects that are required, it will normally gather all of these objects and write out what Git calls a thin pack. A thin pack is a pack that violates one of the normal constraints of a pack file, so now it's time to describe what a pack file is for.

Pack files use delta compression to reduce the need for disk space, and the delta-compression works best when the packs are generated with a batch of files at a time. (This also feeds into the calculation of when to turn a collection of loose objects into a pack.) Note that loose objects are merely zlib-compressed, not delta-compressed, so at the object level Git does not use delta-compression. This also means a pack file is often considerably smaller than the set of loose objects that it contains.

For a simple example, suppose the very first commit in a repository has a fairly big file (a few dozen megabytes or whatever: for concreteness, say it's 10 MiB). Subsequent commits either add a little bit to the file, or take a little bit away from it. Git must initially store the new commits with a new loose object that is also about 10 MiB, to store the slightly-different content. So each commit that modifies this big file adds 10 MiB to the repository.

Once we can pack the objects, though, we can pick one of these objects—probably the most recent copy of the file, as it's the one we are most likely to check out—and store that one in full, and then store other versions of the file as instruction sequences: start with the big file, then remove 140 bytes at the end for instance.² The deltas can use multiple objects via multiple instruction sequences, and can refer to objects that are themselves stored using delta instructions, as long as the graph of objects used in these constructions is not circular. The end result, of course, is that if we have 50 copies of the 10 MiB file, each slightly different, the pack file holds just the 10 MiB file plus about 49 short modifiers.

The objects used to construct the final objects are called delta bases. As we just noted, a delta-compressed object can itself be a delta-base. A chain of deltas is called a delta chain and decompressing such an object involves a bit of recursion. As long as the pack file is well-formed, the recursion is never infinite, so that's fine; and we can use techniques like memoization to make this go reasonably fast, if needed.

In any case, the normal constraint on a pack file is that it should contain every object that is needed to reconstruct the final object. A thin pack is one where we allow the sending Git to assume that the receiving Git already has some objects, and use those objects as delta bases without including them in the pack. So a thin pack can be very small indeed: it's ideal for transmission across a network connection.³

The result is that git push normally sends a thin pack. The receiving Git should take this thin pack and "fix it" to make a regular pack. No loose objects are created during this process. The fact that you're getting loose objects indicates that your pushes are not using thin packs. This isn't wrong, but you might investigate why this is the case.

¹These files are all written once, then never touched again, except to be removed after being packed into a pack file. (They don't have to be removed, but that's the normal action.) You can also explode a pack file into individual objects.

Note that all Git objects are completely read-only, because their hash ID names are constructed by hashing the contents of the object file. Each file begins with a header giving the object's type—one of the four object types—and size, and the type-and-size-bytes are included in the hash ID, which fortuitously protected Git from the original SHAttered attack (see How does the newly found SHA-1 collision affect Git?). Still, the hash algorithm will eventually be upgraded to a more resistant one. This transition will be an interesting time, in the same sense that 2020 has been an interesting year.

²The actual encoding is, I think, composed of just two instructions: "take n bytes from offset o of object obj" and "insert literal byte sequence S", but one can imagine any kind of instructions here. They're all more or less equivalent. One can add extra instructions, such as "copy n bytes from offset once" vs "copy n bytes from offset, repeating r times", or require the copy operation to specify the number of copies to make, or whatever, but these are all just small tweaks. A richer instruction set generally offers more compression opportunities, at the cost of more-complex code to find a minimal compression, and a larger encoded-instruction format.

³The operating assumption here is cpu is cheap, network bandwidth is expensive.

Finishing up

We begin with a git push. This sends objects, usually as a thin pack. The receiving Git should store these objects, or this thin pack, somewhere: modern Gits use a quarantine area, and old Gits just dump them right into the object database.

Having sent the objects, the sending Git now sends a sequence of name updates. These affect the name database, which is the other primary database in a Git repository. The names stored in this database are branch names, tag names, remote-tracking names, and any other names that Git finds useful. A push normally sends one or more branch and/or tag name update requests.

The receiving Git is allowed to inspect and verify these requests, using the objects that were received (and maybe quarantined) to vet everything. If the vetting passes—if there is no vetting, it just automatically passes—the receiving Git then inspects the name updates. Branch name updates must either be forced, with the --force or + flag in the git push command, or else be fast-forward operations or new names.

A fast-forward operation leaves the name in a position such that, by following the commit graph backwards, the commit identified by the previous position is reachable from the new position. In other words, the receiving Git might get a request to update branch name br1. The new commit identified by the updated name must be a descendant of the commit currently found via the name br1.

If all is OK with the name update, and all is permitted via the pre-receive and update hooks (if any) that did the vetting (if any), the receiving Git accepts the update and fixes the thin pack, or otherwise moves the objects out of quarantine. This is when you'd get new .git/objects/ directories created, if necessary.

The Git that's doing the receiving creates these directories with mkdir system calls. These use both the umask of the Git process doing the mkdir, and the permissions supplied to the mkdir call. The ownership of the new directories is set by the OS's rules: the group owner might be the group ID of the process, or it might be the group ID of the parent directory. Using the set-group-ID trick is a fairly standard way on Unix and Linux systems to tell the OS to set the group-ID of the new directory based on the group-ID of the containing directory.

If your Git is using pack files—as it generally should be—the main issue would be making sure that the .git/objects/pack directory and its contents have the right ownership and permissions. If your Git is using loose objects, figure out why, as well as looking into making sure that new directories here have the right ownership and permissions. These are all controlled by your OS; Git's role here is merely to set its umask and pass the right arguments to open and mkdir system calls.

Although lengthy, it has been worth for me reading every word. Now I know what I am doing when playing around with access rights in remote repositories. Super! However the last question remains: why does default automatic git gc not work in my case. Doing it manually worked fine. — ngong, Nov 28 '20 at 09:08
Hm. Given your error message, I think you have a Git that is new enough to have the quarantine area; given the failure of `git gc --auto` to do anything, I wonder if you have a Git from the era of the bug where a `git gc --auto` fails (for any reason, which can include permissions issues) and that leaves behind a trace file that makes future `git gc --auto` operations fail. I *think* a manual gc in that case prints the error message, though. I'm not sure why you have seen this behavior. — torek, Nov 28 '20 at 14:37
I guess, I was wrong: git gc will not run automatically by default on the remote ssh server, where only --bare repositories are hosted. And - as far as I found out - git gc is not as important for --bare repositories as it is for the client. Hope, I got it now. — ngong, Nov 28 '20 at 17:35
Well, `git gc` *should* run automatically, but yes, as long as you're getting pack files rather than individual objects, it's not as needed. At some point though you build up a lot of pack files and a repack that compresses them down to a single pack file is a good idea. (Git is growing a new set of facilities to make this all work better right now, but none of it is in general use as far as I know.) — torek, Nov 28 '20 at 18:24
hmm - I am running git 2.17.1 on Ubuntu. My oldest repositories are from April'19 on that remote server. They are all --bare. Only action is push and pull from clients. Folders below objects are never deleted, except I do git gc manuall (on the server). Hence I conclude git gc is not running by default. How to tell .gitconfig to run git gc --auto on any push? — ngong, Nov 29 '20 at 16:21
You mention that *folders below `objects` are never deleted* (by an implied auto gc): does the user running ssh have permission to *delete* from the `.git` directory? If not, whoever kicks off a push is also the one kicking off the auto-gc, and won't have permission to delete an empty subdirectory within the `.git/objects` directory. This isn't really harmful, it just means that the top level sub-directories accumulate even when they are empty. — torek, Nov 29 '20 at 17:07

unable to migrate objects to permanent storage

2 Answers2

Finishing up