0

I have a folder project with large number of files and subfolders. I have created a repository of this folder via git init to obtain the folder structure below.

project
    --- .git/
    --- large number of files and folders .gitignore'd
    --- very few text and related files not under .gitignore
    --- .gitignore

The very first (and thus far only) commit in the repository only contained a few text and related files not .gitignore'd.

The raw size of the committed files on my disk (working tree) is just a few kilobytes.

More specifically, the committed files are:

3 .tex files of total size 9 KB
4 .lyx files of total size 32 KB
1 .gitignore file of size 1 KB
3 other .txt files of total size 4 KB

Yet, at this stage, the raw size of the .git folder is 84 MB. The size of the project folder itself is around 5 GB, most of which are .gitignore'd.

Is there a way I can try to figure out what is causing this large gap between the actual committed files and the size of the .git folder?

Tryer
  • 3,580
  • 1
  • 26
  • 49
  • So what is the size of actual committed files? – user7860670 Apr 02 '22 at 14:48
  • @user7860670 updated the OP with details – Tryer Apr 02 '22 at 14:54
  • 1
    Can you identify which subfolder/files in the .git directory are large? (Likely the objects folder.) – TTT Apr 02 '22 at 14:56
  • "the actual committed files" How do you know what files are in this commit? Did you look, or are you guessing? – matt Apr 02 '22 at 14:58
  • Seeing that you only have a few files which you want tracked, did you try running a `git ls-files` command to make sure that there is nothing included that you don't want. – sgmoore Apr 02 '22 at 15:01

1 Answers1

2

If you made a prior commit with many more files, and then re-wrote it, the commit is still in the repo until it is garbage collected. But I'll take your word for it that you didn't do that:

The very first (and thus far only) commit in the repository only contained a few text and related files not .gitignore'd.

Therefore, the simplest explanation for this is that you staged a large number of files before getting your .gitignore file setup properly. Even staging files without committing them will take up space in the repository, at least temporarily. You can easily prove this is the cause with the prune command:

git prune -n # dry run, show what would be removed
# and to actually do it
git prune

Then check your repo size again.

Side Note: under normal circumstances you don't need to run the prune command because it happens automatically during garbage collection, however the prune default is 2 weeks. So if you wish to use gc to force a full pruning, then you could use:

git gc --prune=now

Side Side Note: I always advise people to commit early and often, because if they ever really mess something up, they can traverse their reflog to find old unreachable commits to recover lost (but previously committed) work. Since by default even unpacked objects sit around for 2 weeks, you could potentially recover files that were only staged in the last 2 weeks but never committed.

TTT
  • 22,611
  • 8
  • 63
  • 69
  • 1
    This worked exactly! Before this, the largest folder was the objects folder within git. After the above, the size is down to about 45 KB! – Tryer Apr 02 '22 at 15:32
  • when I ran `git prune -n`, it gave me a long list of `blob`s with longish SHAIDs (?). Is it possible to somehow cast these ids into human readable filenames on my disk ? – Tryer Apr 02 '22 at 16:24
  • 1
    @Tryer Unfortunately, probably not. blobs are file contents without the filenames. Maybe you can deduce what the filenames were by viewing their contents. See the link in the "Side Side Note" for details. – TTT Apr 02 '22 at 16:34
  • Thanks. I will go to sleep well tonight approximately knowing what precisely is going on in my repo. Otherwise, it is this nagging troublesome feeling where you end up using some technology without knowing what exactly it is doing. Git is somewhat like that for a newbie -- very few files committed, but the repo size being huge, etc. – Tryer Apr 02 '22 at 16:42
  • 1
    @Tryer: if you know about Unix file systems (directories and files and inodes), Git's storage model will make a bit more sense. Inside the repository database there are four object types: tag (annotated tag), commit, tree, and blob. Ignoring tags as uninteresting, take a look at your current commit with `git cat-file -p HEAD`. Note that what you see is only the metadata but there is exactly one `tree` line. Copy-paste the hash ID shown and run `git cat-file -p `.( The tree object is binary but `cat-file -p` makes it readable.) – torek Apr 02 '22 at 23:18
  • 1
    The tree object corresponds pretty well to a Linux directory: it has "tree entries" giving a file name, file type (mode), and blob hash ID. Instead of storing the type/mode in an inode, Git stores them directly in the directory entry. The blob hash ID is then roughly equivalent to an inode number: `git cat-file -p` of a blob hash ID spills out the content. A `tree` entry can also be another `tree`, which provides the mappings for file names: `path/to/file` is a tree entry `path` with mode and hash leading to a tree that holds `to` that leads to a tree that holds `file` (and blob hash ID). – torek Apr 02 '22 at 23:20
  • 1
    The curious thing here is that Git first reads the tree into Git's *index*, which is essentially a flattened set of trees minus the directories, with the path names strung together, so that we have only `path/to/file` and the mode and other cache data. It's the index's lack of support for "directories" that makes Git unable to store an empty directory: Git builds the *next* commit from the index, and the index never has a directory entry; Git synthesizes a new set of trees based on the stored path names in the index. – torek Apr 02 '22 at 23:22
  • 1
    Anyway, since the index does store paths-and-modes-and-blob-hash-IDs, `git add newfile` first compresses and hashes the contents, checks for duplicates, and if needed, writes a new blob object. Then it shoves the path-mode-hash triple into the index. When `git gc` or `git prune` goes to remove unused objects, Git has to traverse *all* possible references to Git objects, which involves reading every ref, every reflog, and the index, and using that to construct a reachability bitmap (or equivalent). This is pretty expensive and is why `git gc` is rare-ish. – torek Apr 02 '22 at 23:25
  • 1
    Note that `git worktree add` messes with this a bit: now we have to read not just all the refs and *the* index, but rather all the refs and *every* index (and every added worktree's `HEAD`). This was the bug between Git 2.5 and 2.15: `git gc` / `git prune` failed to read those, and would prune objects used only by an added working tree. – torek Apr 02 '22 at 23:26