How git knows about changing file name from a single blob?

Question

I created a pure directory and started to track it with git and I track .git directory with this command:

tree -C -I 'info|pack' .git/objects

I create a file named readme.md with content hello world and staged this using git add . git creates a blob:

.git/objects
└── 3b
    └── 18e512dba79e4c8300dd08aeb37f8e728b8dad

as far as I know, a blob knows noting about file name and there is no other git object in .git directory

├── objects
│   ├── 3b
│   │   └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

but when I rename the file to readyou.md and run git status


On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   readme.md

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    deleted:    readme.md

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    readyou.md

git is totally aware of the changes, so it should track file name somewhere if it's not the blob, where is it?

torek · Accepted Answer · 2020-04-09T04:43:36.793

This has nothing to do with the internal blob storage format, and everything to do with the actual commands you ran.

In particular, you did:

git init
echo hello world > readme.md
git add readme.md

So far so good: you've had Git copy the readme.md file into Git's index. Since the index doesn't actually hold copies of file data, Git had to make the internal blob object you mention; what's in the index now, if you dump it out with git ls-files --stage or git ls-files --debug, is one entry with mode 100644 (plain file), name readme.md, and the blob hash.

But now instead of git mv readme.md readyou.md or some other equivalent series of commands, you ran:

mv readme.md readyou.md

This renamed your work-tree copy of the file. The file in your work-tree is managed by your operating system, not by Git. The copy in your work-tree is not what will go into the next commit: what will go into the next commit is whatever is in Git's index. This is entirely unchanged.

Your git status command works by doing the following:

List information about the current branch and that sort of thing. You're currently on an unborn branch, or orphan branch (Git uses both terms for this): the name master is stored in .git/HEAD but no branch named master actually exists, because no commits exist yet. So the first line of git status says this.
Compare the current commit to the index. There is no current commit yet, so Git uses instead Git's empty tree as the left-side of this comparison. The result of the comparison is to announce that there is a new file named readme.me that will go into the next commit. These are your changes to be committed.
Compare the index to the work-tree. The index contains a file named readme.md, and the work-tree doesn't. So readme.md is deleted. The work-tree contains a file named readyou.md and the index doesn't; Git holds on to this for a moment. The result of this comparison is to announce that the file readme.md is deleted, but these are your changes not staged for commit.
Finally, since there are work-tree files that is not in the index, Git filters this list through the .gitignore entries. (Technically this actually happens while scanning the work-tree: there's no point putting files into a list, then knocking them out, when Git can just avoid putting them in at all.) Since there are no .gitignore entries, Git doesn't shut up about this: it prints these files' names, as untracked files.

The key to all of this is to realize that Git has this extra copy¹ of each file in Git's index, and that git commit will build the new commit from the index. Git will not even look at your work-tree here. It's what is in the index that matters. The git status command does not show you what's in the index; instead, it shows you what's different in the index, twice: once as compared to the current commit (or the empty tree in this special state), and once as compared to your work-tree, on the assumption that you can actually see what's in your work-tree.

If the index matches the work-tree, there's nothing you can copy from the work-tree, to the index, to change it. So that's not interesting, and the last two git status output sections—changes not staged, and untracked files—are empty. If the index matches the current commit, there's no reason to make a new commit, so that's not interesting, and the changes-staged-for-commit section of output is empty.

¹Again, the index really holds (mode, name, blob-hash) tuples. It also has a bunch of cache data, which is why it's sometimes called the cache, though these days, cache is supposed to be reserved for the in-memory data structure that Git builds by reading the .git/index file. Because you use the index to "stage" updated files for committing, the index is also called the staging area.

Note that `git mv` is not special in any way

If you use git mv readme.md readyou.md, Git:

renames the work-tree file, and
rips out the old readme.md index entry and puts in a readyou.md entry instead.

git status will now just say that there is a new readyou.md file, ready to be committed. But suppose you actually commit the readme.md file first. Now git status will say that there is a rename ready to be committed.

Technically, all that will happen if you make a second commit is that the second commit will have a file named readyou.md and not have a file named readme.md. It's Git's difference engine—the program you can invoke with git diff, that git status also invokes, and other Git commands use—that decides that this file was "renamed".

If, instead of git mv, you run:

mv readme.md readyou.md
git add readme.md readyou.md

the separate git add step here will remove readme.md from the index and copy readyou.md into the index (again see footnote 1—it will really just keep the existing blob object). Oddly, git add path/to/file removes path/to/file from the index if it's there in the index and not there in the work-tree. The way to make sense of this is to realize that git add doesn't just mean copy file from work-tree to index, but instead means something simpler: make index match work-tree. The "match" in the case of a renamed or removed file is to remove the old name.

When Git is comparing two commits, or one commit and the index, or the index and your work-tree, or whatever, Git will detect a rename only under certain conditions. These include:

The rename detector must be enabled. When you run git diff you can turn it on manually with -M or --find-renames. It defaulted to off in old versions of Git, and switched to on by default in Git 2.9. You can configure your own private default using the diff.renames configuration knob.
The file's content must either match exactly—an exact match is fast for Git—or be "similar enough". This similarity is expressed as a percentage: an exact match is 100% similar, and a file with no bytes in common is 0% similar. Files that are not identical but do have some bytes in common, as determined by a quick and dirty file scanner, are some percentage similar: more than zero, less than 100. When using --find-renames or -M you can set the similarity threshold. The default is 50%. Files that aren't at least 50, or whatever other number, percent similar are considered "not renamed".
For Git to detect a rename, there must be a file on the "left side" that is just not there on the "right side", and vice versa. For instance, in this case, the left side—the first commit—would have a file named readme.md, but the right side—the current index content—would not. The right side has readyou.md but the left side does not. Git pairs up left and right side files that have the same name. Any files left over, i.e., not paired at this point, are candidates for rename detecting.
When using git diff, there's something else you can do with the -B (break-pairings) option. You can't do this with git status so we won't go into any further detail.

The git status command has the rename detector turned on and set to 50% by default. In older versions of Git there is no way to change this, but Git 2.18 added status.renames as an option, and made it follow diff.renames as a default if you don't have it set. So now diff.renames also controls git status, unless you set both diff.renames and status.renames.

The point of all this verbiage is that git mv isn't really special. It just takes care of two rename operations at once: one in the index, and one in your work-tree. If you forget, just use git add to remove the old name and add the new one. The effect is exactly the same: the next commit you will make, as stored right now in Git's index, is now updated correctly.

thanks a lot, you explained way more than I expected. – farhad Apr 09 '20 at 01:36 — farhad, Apr 09 '20 at 01:36

How git knows about changing file name from a single blob?

1 Answers1

Note that git mv is not special in any way

Note that `git mv` is not special in any way