Warning: File is larger than recommended maximum filesize

Question

Is it possible to find the name of the file? The error is

Warning: File 9c9e8c2357f961122596db1ae70d19e1b168e7a7 is larger than recommended maximum filesize on the server

, while trying to push a git repo on another serever.

See https://stackoverflow.com/q/223678/1256452 (turn blob hash into commit and path) — torek, Jul 27 '19 at 17:43

score 1 · Answer 1 · answered Jul 27 '19 at 09:37

This file is likely located in the object store, i.e. .git/objects. The first two digits 9c is probably the directory where you'll find your file: .git/objects/9c/9e8c2357f961122596db1ae70d19e1b168e7a7.

See also: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

If you've got some time, it's well worth reading this to help you understand git: https://jwiegley.github.io/git-from-the-bottom-up/

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

Your question may look like a duplicate of Which commit has this blob? but that particular question-and-answer assumes some understanding of Git's internal storage techniques. It might not make any sense without some background. Moreover, just finding the file might not do much good. If you already know what commits, trees, and blobs are, go straight to the accepted answer; otherwise, read on.

Git stores every version of every file, forever(ish)

Each commit contains—or more precisely, holds a reference to—a complete snapshot of every file you have told Git to keep. That is, you often look at a commit as a change, e.g., add a line to file a/b/c.py. If you run git show hash, that's how Git shows the commit, and that's what git log -p shows as well. But in fact, if you have files Makefile, README.md, a/main.py, a/b/lib.py, and a/b/c.py, each commit has a full and complete copy of each of these files.

If Git did this the obvious way—by actually making a new copy of each file every time—your Git repository would rapidly become ridiculously huge. So Git doesn't do it that way. Instead, Git makes a compressed, frozen, read-only, Git-only copy of each file. I like to call this format freeze-dried, because it stores well and lasts forever, and you (or Git) will later turn it back into a useful file by "rehydrating" it: decompressing and de-Git-izing the file.

The compression saves some space, but even more important is this read-only frozen-ness. The README.md file is frozen. It cannot be changed at all, not even by Git itself. That means that when Git makes the next commit, if you haven't actually changed README.md, Git can just re-use the existing frozen copy.

The freeze-dried file—which Git calls a blob—doesn't have an ordinary name. It has this weird hash-ID name. In your case, the blob hash name is 9c9e8c2357f961122596db1ae70d19e1b168e7a7. Another, separate part of the commit—something Git calls a tree object—contains a mapping from the name you will use, such as a/b/c.py, to the hash name that Git uses.¹ We'll see a bit more about this in a moment, though the real focus is not on trees, but on commits and their blobs.

Different commits that use the same contents—the same stuff to go into a/b/c.py—will share the underlying blob. There will be just the one 9c9e8c2357f961122596db1ae70d19e1b168e7a7 object, but now two commits use it. Make another thousand commits, or even a million new commits, and you still just have the one underlying blob. It's shared across every commit that uses that particular version of that particular file. You only get a new blob, with a new and different random-looking hash ID, if you change the contents of the file.

¹This is part of the reason Git gets really hard to use if you're on a case-insensitive file system, as Windows or MacOS uses by default, and you're working with a Linux user (who uses a case-sensitive file system as Linux tends to have) who makes one file named README and another different one named ReadMe. Your computer can't store both files using these two names. But Git is using weird blob hash names and tree-object data files instead of ordinary directories / folders. Git doesn't store folders at all, just files with tree-objects holding their names. So Git can and does store both files' contents in its internal database. But Git can't let you work on both files, because your case-insensitive file system won't let you have two files whose name differ only in case. The Linux guy has no problem, because his file system does let his Git do that.

Tree objects can be shared too. All object-sharing happens automatically in Git, via Git's clever hash name mechanism—but these tree objects are typically quite small, so that whether these are shared or not, doesn't matter very much. The important part is the blob-object sharing.

Commits also get hash IDs, and are linked into chains

Besides these blob hash IDs, every commit you make gets a unique hash ID.² Here's an example of an actual commit in the Git repository for Git itself:

$ git cat-file -p HEAD | sed 's/@/ /'
tree 33bba5e893986797fd68c4515bfafd709c6f69e5
parent 8619522ad1670ea82c0895f2bfe6c75e06df32e7
author Junio C Hamano <gitster pobox.com> 1563561263 -0700
committer Junio C Hamano <gitster pobox.com> 1563561263 -0700

The sixth batch

Signed-off-by: Junio C Hamano <gitster pobox.com>

The tree line provides the hash ID for the tree object that holds the name-to-hash-ID mappings for the blob objects (i.e., frozen files) that go with this commit. The parent line provides the hash ID of the previous commit.

If we use single uppercase letters to stand in for the big ugly commit hash IDs, we can draw a picture of this. Suppose we have a tiny little repository with just three commits in it, which we'll call A, B, and C, in the order we made them. Our last commit, C, will remember the actual hash ID of our second commit B. Our second commit will remember the actual hash ID of our first commit. We will say that C points to B, and B points to A, and draw them like this:

A <-B <-C

Each commit has a snapshot of all the files. If B mostly has the same files as A, it just re-uses A's files directly, via their blob hash IDs. If C mostly has the same files as B, it just re-uses B's re-used files. Only new-contents or totally-new files require any new blobs.

Our branch name master will then hold commit C's actual hash ID. That lets us, or our Git at least, find the last commit. Commit C holds B's ID, so from C our Git can find B, and then B points to A so our Git can find A too. A is the very first commit, so it doesn't point anywhere, and git log can stop after showing us all three commits:

A <-B <-C   <--master

Git starts by reading master to find C and shows us C. Then Git moves back one step, using the parent stored in C to find B, and so on.

To add a new commit D, we have Git:

Save any new blobs needed - we'll re-use the old ones as appropriate.
Write out any tree objects needed to hold the file name(s) for all the files (name-and-blob-ID pairings) that will be in our new commit D.
Write out a new commit object. The tree will be the tree we made in step 2; the parent will be commit C; the author and committer will be us (name and email); the time stamps will be "now", encoded the way Git encodes date-and-time; and the log message will be whatever log message we tell Git to use.

(Writing out the commit object assigns the commit its unique hash ID. This hash ID appears random, but is actually totally determined by the commit's contents. It's actually a checksum—currently SHA-1—of the header for the commit object plus the contents, so that every Git in the universe will compute the same hash ID for this commit. This is what allows Git to have clone repositories.)
Now that commit D exists and points back to C, overwrite the name master with D's commit hash ID.

Now we have:

A--B--C--D   <-- master

(It's easier in ASCII text not to draw as many arrows, especially when we get a little more complicated, so just remember that all the internal arrows, from commits to their parents, point backwards.)

To make a branch, we just make a new branch name, such as feature, that points to any existing commit, such as D:

A--B--C--D   <-- master, feature

All four commits are now on both branches. Then when we make another new commit E, Git updates one of the two branch names, but not the other:

A--B--C--D   <-- master
          \
           E   <-- feature

and now commits A through D are on master, and A through E are on feature. If we go back to master and add a commit F, we get:

           F   <-- master
          /
A--B--C--D
          \
           E   <-- feature

²Commits never get shared because Git records not only who made the commit—your name and email address and so on—and other useful information like that, but also includes a date-and-time stamp of when you made it, and the hash ID of the previous commit. In fact, Git stores two time stamps here—see the author and committer lines—but the key is that something in a new commit is always at least slightly different from everything in every previous commit, so that the new commit gets a new, unique-to-it, hash ID. The only way to get the exact same hash ID is to record the exact same data: that you made this commit, with this saved source, at this exact time and date as the last time you made this same commit with the same saved source at the exact time and date ... in which case, well, that's just deja vu. :-)

Getting rid of commits is hard, but not impossible

Commits are totally frozen for all time. They're findable by starting at a branch name and working backwards. (For much more about all of this, see Think Like (a) Git.) Finding a commit gives you access to all of its files.

To delete a file—really, a blob object—from Git, you must find all the commits that refer to that blob. Suppose, for instance, we have:

           F--G   <-- master
          /
A--B--C--D
          \
           E--H   <-- feature

and we accidentally committed a really big file at D. That same file is also linked to commits F and G, since they also use the same blob. We cleverly removed the file from E, so it's not in E and H.

If we want to remove the big blob, we have to come up with a new replacement commit that's like D, only different: it no longer has the big file. That new commit will get a new and different hash ID, even if we re-use all the date-and-time-stamps and log messages and so on, because it will have a different tree object in it. We'll call this new commit D':

           D'
          /
         /
        /  F--G   <-- master
       /  /
A--B--C--D
          \
           E--H   <-- feature

Like D, D' has parent C. It has everything except the same tree as D, in fact. But since it has a different tree (that omits the big file), it has a different hash ID.

We cleverly made E and H without the big file, so we're OK here ... except that E's parent is D. So now we need to copy E to a new commit E' that's like E, except that its parent is D'. Having copied E, we now have to copy H to a new commit that's just like H except that its new parent is H'.

Meanwhile, we also have to copy F and G to new commits F' and G'. F' will be like F' except for two changes: it won't have the big file, and it will have D' as its parent. G' will be like G but won't have the big file and will have F' as its new parent:

               F'-G'
              /
             D'
            / \
           /   E'-H'
          /
         /
        /  F--G   <-- master
       /  /
A--B--C--D
          \
           E--H   <-- feature

What we have done here, in other words, is to re-copy everything "downstream" of the bad commit D so that they're all fixed-up "better" commits.

Now that we have done this, we can take the two branch names, master and feature, and yank them off their existing commits and make them point to the copied commits:

               F'-G'  <-- master
              /
             D'
            / \
           /   E'-H'  <-- feature
          /
         /
        /  F--G
       /  /
A--B--C--D
          \
           E--H

Now we can't find commits G and H—git log, even with --all, won't see them because there is no branch or tag name by which to find them. So now we can forget they ever existed, and stop drawing them in. If we don't remember the actual hash IDs of D, E, F, and G, we won't even notice that D', E', F', and G' have different hash IDs, and we can pretend that these were the right commits all along.

Once the commits are unfindable, they will eventually go away. They're still frozen, but eventually they just fall out of the back of the freezer, as it were. :-) This doesn't happen right away: Git tries hard not to lose information. If the big files aren't so big as to create huge problems with day to day work, just let them go away on their own (typically this takes 2 weeks to a month or more, depending on a lot of things). Otherwise, search StackOverflow, or look closely at the notes near the end of the git filter-branch documentation.

But there is a very big ugly wrinkle here. If we ever sent these commits—D, E, etc—to some other Git, that other Git has these commits, complete with the big file, stored by these original commit hash IDs. That Git will, every time our Git connects to it, be able to offer the commits back to us. If we take them back, we're right back to the situation where the big file is in our repository! Note that having commit E, even though it does not have the big file, implies that you should also have commit D, which does have the big file. Commits come with their history, and their history is simply every commit you can find by working backwards.

This kind of almost-viral "re-infection" with bad commits means that once you have made some big-file commits, they can be very hard to eradicate. You have to get rid of them from your repository, then make sure—maybe with git push --force—that you get rid of them from every other repository that your Git has had Git-sex with.

There are tools for getting rid of big files

I've already directed you to a question about finding the commits that contain big files, but usually it's more interesting to get rid of the big files. We also saw an outline of what must be done, but actually doing that is quite messy. Fortunately, there are several tools for this. See How to remove/delete a large file from commit history in Git repository?

Read through the sections above to understand what these tools will do, and the various caveats. If you plan to use The BFG, read through its documentation carefully: it tries not to touch the last commits on each branch—the ones that the branch-names identify, in the drawings above—on the assumption that you have already fixed those up, by removing any big files. If you plan to use git filter-branch instead, there's no special caveat here, except to note that that git filter-branch is difficult to use properly (it's slow, and it's easy to forget to include a --tag-name-filter if you have tags that might preserve unwanted commits).

Warning: File is larger than recommended maximum filesize

2 Answers2

Git stores every version of every file, forever(ish)

Commits also get hash IDs, and are linked into chains

Getting rid of commits is hard, but not impossible

There are tools for getting rid of big files