Let's take some definite items / hard facts into account here first:
Git isn't about files, it's about commits.
Commits are numbered, e.g., dcaa9f9
(seen in the git branch -vv
output) or ac7cXXXX
(seen in your git log
output). These numbers—in hexadecimal—are hash IDs, so they aren't in any sensible order and not very useful for humans, but they are how Git really accesses each commit.
The hash IDs are actually cryptographic checksums of the contents of the commit, which makes all parts of every commit completely read-only. Nothing can change in the commit once it's made. So in general we just add new commits to the repository, which is how Git stores history. The commits are the history.
Commits store files, but not as changes. Each commit stores a full snapshot of every file—or more precisely, every file that Git knew about, at the time someone ran git commit
to make that commit. (These are the tracked files: the untracked files are the ones that aren't in the next commit you'll make.)
Commits also store metadata. This includes information about who made the commit, when, and why (the log message). In this metadata, Git stores, in each commit, some hash IDs. These are IDs of commits that existed at the time that you (or whoever) made the commit, so they're necessarily hash IDs of earlier commits. In general, most commits store exactly one hash ID: the very previous commit, from which this commit was made. Most of the remaining commits are merge commits, which store two hash IDs: the previous commit, and the commit that was merged.
The hash IDs in the metadata, which Git calls the parent commit(s) of the commit in question, form the commits themselves into a DAG. In the case of a simple chain of commits—the most common thing—we'll draw this DAG-fragment ("DAGlet") like this:
... <-F <-G <-H
where H
is the hash ID of the last commit in the chain. Then, being lazy, we'll get sloppy about our arrows, which lets us draw multiple DAGlets that branch and merge:
I--J
/ \
...--G--H M--N <-- main
\ /
K--L <-- feature2
for instance. The names at the right, which automatically and always point to the last commit in the chain, are our branch names. The lettered nodes in the graph above are our commits, which store files permanently.
Git shows you changes by comparing the stored files. Pick any two commits. For instance, pick a parent/child pair, like G-H
or H-I
or M-N
or whatever. Each of those commits has a full snapshot of every file. Perhaps the snapshot in H
has one file that's different from that in G
, and one file that isn't in G
at all. Then the comparison of G
vs H
will show one changed file and one added file.
Note that to compare a commit against its parent (singular), we have to have just one parent. That's great for all the commits above, except for merge commit M
. It has two parents. If you ask Git to show you what changed in M
, should it compare J
-vs-M
, or L
-vs-M
?
It might be nice if it would do both. In fact, some Git commands do do both, but then they get a little squirrelly about that. The git log
command, however, by default just doesn't bother to compare against either one. This is going to be a problem in a moment.
Meanwhile, there's one more thing to note about the files stored inside commits. They're stored not as files, but rather as special, read-only, Git-only, compressed and de-duplicated entities (Git calls these blob objects internally though you don't normally need to care about the details). Your own programs can't actually use these, so in order to make a commit useful, Git has to extract that commit, into a working area.
Hence, all the files that you see and work with when you work with a Git repository are not in the repository after all. They are in your working tree or work-tree. These are not in Git. They were at most extracted from Git. A future git commit
won't use these files either: Git builds new commits from what Git calls, variously, the index, or the staging area, or—rarely these days—the cache.
When you pick some particular commit—by checking out a branch, by using git checkout master
for instance—Git works by extracting that commit's files. Git uses the branch name, which holds the commit's hash ID, to find the commit. The original copies of the file, as seen in the commit, go into Git's index (where they're still de-duplicated so that they take virtually no space in the index) and into your working tree (where they're expanded back into usable files, which do take space).
We then work on / with our files—the ones that aren't in Git—because these are the useful files. When we're done working on / with them, we must run git add
on at least some of them. We can run git add
on all of them, en-masse all at once, to be lazy and let the computer do the work, as long as we're careful to make sure that Git won't auto-en-masse add untracked files that we don't want to have in the next commit. Or, we can run git add
only on the ones we've changed. What this does is to tell Git: make the index / staging-area copy match my working tree copy, for each file we actually add. Git will now compress them down, de-duplicate them by checking against every existing file stored anywhere in the repository, and update the index / staging-area to refer to the correct file contents, ready to go into the next commit.
This means that the index / staging-area acts as a storage space for your proposed next commit. It always has all the files in it, it's just that most of the time, most of those files—or even all of them—match the files in the current commit.
When we make a new commit, Git simply packages up all the files that are in its index at that time, adds the appropriate metadata—including the hash ID of the current commit, as found through the branch name we picked earlier when we ran git checkout
—and writes all of this stuff out to make a new commit. The new commit gets a new, random-looking hash ID that is guaranteed1 to be different from all existing hash IDs. The new commit object goes into the database of all objects, indexed by hash IDs. And then Git stores the new hash ID into the branch name, so that the name picks out the latest commit.
With the invariant restored—that the current branch name holds the current hash ID and that we can find all earlier commits, one at a time, by following the parent links—Git is ready for more work. Note that the commit is made from whatever is in Git's index. The files in your working tree are irrelevant.
1What pigeonhole principle? Collisions never happen!
What you're seeing
Let's start with the git branch -vv
output:
$ git branch -vv
dataprocessing dcaa9f9 Merge pull request #122 from XYZaiXYZ/toyota
master dcaa9f9 [origin/master] Merge pull request #122 from XYZaiXYZ/toyota
* toyota dcaa9f9 [origin/toyota: ahead 1] Merge pull request #122 from XYZaiXYZ/toyota
There's a fair amount of information here. We have three branch names. All three names identify the same commit, whose hash ID starts with dcaa9f9
(actual hash IDs are longer but any unique initial abbreviation of at least 4 characters suffices, so dcaa9f9
is fine here, and we can probably get away with just dcaa
).
We have two remote-tracking names: these are our Git repository's memory of some other Git repository's branch names. These are set as the upstream of the corresponding (local) branch name: master
links to origin/master
as master
's upstream, and toyota
links to origin/toyota
as its upstream.
We can't see the hash IDs that are stored in the remote-tracking names here, but git branch -vv
does do something special, which we see in the third line: ahead 1
. This means we have one commit on our (local) branch, toyota
, that's not on their toyota
branch. The origin
Git repository has a toyota
branch too, but their toyota
stores a hash ID that isn't dcaa9f9
. I don't know what it is, but I do know, from the ahead 1
text, that dcaa9f9
has this commit as its parent, or perhaps as one of its parents, plural, if dcaa9f9
is a merge commit.
Last, we also get the subject line of each commit message, for each commit. Since we get the same commit three times, we get the same subject line each time. The subject line we get is Merge pull request #122 from ...
. This is the kind of (terrible, but at least standardized) message that GitHub will generate, for instance, when you use their web interface to perform a merge. So dcaa9f9
is almost certainly a merge commit, with two parent commits. Our origin/toyota
, which represents our Git's memory of origin
's toyota
, points to one of the parents of this merge commit.
Hence, if we were to draw this, we might draw it as:
...--I--J <-- origin/toyota
\
M <-- dataprocessing, master, toyota (HEAD), origin/master
/
...--K--L
with the letter M
standing in for commit dcaa9f9
. I don't know the hash IDs of any of the other commits (except that J
's starts with ac7c
), but we won't really need them here.
You also mention:
When in the branch:
$ git merge master
Already up to date.
This is, now, no surprise. The git merge
command:
- uses your current commit (
M
or dcaa9f9
) as found through your current branch name (found via the special name HEAD
, which is what it's doing in the drawing above);
- takes, as an argument, something that locates another commit: here,
master
. It then finds the commit; and
- then uses the commit graph we've drawn to find a merge base, i.e., a best shared common ancestor commit.
The commit you ask to merge is dcaa9f9
. That is the current commit. The best shared commit is therefore dcaa9f9
itself. That commit is the current commit, so no merge necessary or even possible. The merge command says Already up to date.
and quits.
$ git diff origin master
[prints nothing]: this too is unsurprising, though we need to learn one new Git trick. The git diff
command takes two commit specifiers.2 The two you gave are origin
and master
.
Now, origin
is actually a remote, not a remote-tracking name. A remote, in Git, is a short name that stores a few things for easy access, and enables some other stuff. The main thing it stores, of interest to most people, is a URL. This is the URL you Git will use when your Git runs git fetch
(or git pull
, which runs git fetch
). The "other stuff" it enables is the remote-tracking names, such as origin/master
and origin/toyota
.
The gitrevisions documentation describes a six-step process for turning a name like master
or origin/master
into a hash ID. Follow the documentation link, scroll down a bit if needed, and read through the six numbered steps. I won't quote them all here, but have a particular look at the last one: step six of six. It talks about looking for refs/remotes/name/HEAD
. This will exist in your repository, and it will almost certainly be what Git calls a symbolic ref to origin/master
.3
What all this adds up to, in the end, is that you're asking git diff
to resolve origin/master
to a hash ID—which it does, and gets dcaa9f9
—and then to resolve master
to a hash ID: dcaa9f9
again. Git then dutifully compares the snapshot in dcaa9f9
to the snapshot in dcaa9f9
. Naturally, every file matches.
Last, in this section anyway:
$ git log README.md
commit ac7cXXXX (origin/toyota)
Author: Mona Jalal <mona@XYZ>
Date: Fri Feb 5 22:40:32 2021 +0000
fixed two typos in the README.md
Here, you may be running into a "feature" (often a mis-feature) of git log
.
When you run git log
, it works by:
Starting from some commit or commits that you pick: if you don't pick one or more starting commits, it starts from the current commit (via HEAD
as usual).
The git log
code places these commit hash IDs into a priority queue. This is because it can only handle one commit at a time. However, when using HEAD
, which only selects one commit, there's just one entry in the queue in the first place.
Walking the commit graph, one step at a time. This part can get quite tricky.
The commit graph walk makes use of the priority queue as follows:
- Take the front entry off the queue. (If the queue is empty we're done: quit.)
- Decide whether to print anything about this commit. If so, print stuff about it.
- Decide whether to visit this commit's parent or parents. If this is an ordinary single-parent commit, we'll visit the (single) parent (except under
--no-walk
of course). If this is a merge commit, though, choose which parent(s) to visit based on any history simplification that is in effect.
- Push any to-be-visited parent commits onto the priority queue, in priority order. (Omit any already-visited parent.)
The tricky part here is in step 3: deciding which parent(s) of a merge commit are to be visited. The tricky part here is also in step 2: deciding whether to print anything about this commit.
We first visit commit M
, because that's the one commit in the queue:
Since M
is a merge commit, git log
is lazy and doesn't, initially at least, try to compare it to any of its parents. It just decides not to print commit M
, because—after not checking—file README.md
seems unchanged, because Git was too lazy to check. So even if M
does have a change to README.md
when compared to J
or L
, it's not printed here.
Since M
is a merge, we check for history simplification. This is turned on! It's turned on because we have a pathspec: README.md
. So now we check whether M
is what git log
calls "TREESAME" to any parent, after stripping the trees down based on the supplied pathspec(s). So now we actually do check whether M
's parents, J
and L
, have the same README.md
as M
.
If one of these two parents does have the same README.md
, that's the one that this particular git log
will follow. Apparently commit J
(ac7c...
) has the same README.md
file as commit M
. Commit J
is the one that origin/toyota
identifies, as we see that right after the commit's hash ID, in parentheses. (This is from the --decorate
option, which defaults to "on" in modern Git.)
So, since commit J
has the same README.md
, git log
visits M
, doesn't print it, and puts commit J
in the queue to walk next, but doesn't put commit L
into the queue at all. This is what Git calls History Simplification in action.
Git now visits commit J
, as it's the only commit in the queue. Commit J
has, as its single parent, commit I
—so git log
does bother to compare I
vs J
, specifically to see if README.md
changed between this pair of commits. It did, so git log
does print commit L
. That's how we know (a) that the merge chose J
in its history simplification process, and (b) that commit J
's hash ID starts with ac7c
—which you left in your quote.
Since J
has I
as its parent, that's the commit that goes into the queue. As the queue was empty, it now has just the one commit in it, and git log
goes on to look at commit I
. This will repeat until git log
runs out of commits, or you get tired of reading its output.
2The git diff
command is kind of fancy, so it can take none, one, two, or in some cases even more commit specifiers. It can also take pathnames and other arguments. This particular form of git diff
takes two commit specifiers, though.
3The value stored in origin/HEAD
is normally set up by git clone
when you do a clone. You can change it using git remote
, with its set-head
sub-command. The initial setting made by git clone
depends on what the Git repository you're cloning has set up as its HEAD
. With GitHub, that's usually either master
or, since the recent switchover, main
, though anyone who's the administrator of some GitHub repository can set whatever they like.
Summary so far
- Git is about commits. Always look for the commit hash IDs, as they're what Git really cares about. If two hash IDs match, that's the same commit.
- Commits store snapshots. What you see and work with are files from the snapshots, at best.
- Git works with the commit graph. Use
git log --graph
to see it with Git (it's often good to use --oneline
and --decorate
: remember "DOG", Decorate Oneline Graph, here; modern Git has decorate on by default). Consider using a graphical viewer, if you find those helpful. Be aware that some graphical viewers are better than others. See also Pretty git branch graphs.
- The
git log
command lies. This is deliberate, and is mostly, usually, a good thing. The only history in a Git repository is the commits in the repository. We often would like to see a "file history". This doesn't exist—but git log
can fake one, by selectively lying to us. But if we're trying to figure out why some change got lost, this selective lying gets in the way. (This isn't your actual problem, but it's worth remembering.)
Your actual problem
I also did git clone the repo in a test dir and I can see [the correct README.md
in the new clone]
This means that the commit you've checked out, in that new clone, has the correct contents in the file. Git copied the committed file to Git's index, and then on to your working tree. The working tree copy in the new clone shows you what's in the index copy, which is from the committed copy.
If your existing working tree copy doesn't match, that just means that ... your working tree copy doesn't match. That's all. Your working tree copy is yours. You can do whatever you like it with it. You can print it out, crumple the printout into a ball, set fire to it, etc. You can remove the file, or encrypt it. Nothing you do do the working tree copy will affect Git's copies: those are safely stored inside commits, read-only, forever unchanging.
You can make new commits that have whatever you like in their README.md
files, or that even don't have a README.md
file, by changing your working tree copy and running git add README.md
. This makes Git make its index copy match your working tree copy, and now a future git commit
will save this version of the file.
Or, if you just want your working tree copy wiped out and replaced with a copy extracted from either an existing Git commit, or from Git's index as it appears right now, you can do that too. There is more than one way to do this. The best way in the most modern versions of Git (2.23 or later) is to use the new git restore
command.
The git restore
command is one of two commands that the Git folks used to break up the git checkout
command. The problem is that git checkout
is too powerful. It does too many different things. So they split it into git switch
, which does about half of the things, and git restore
, which does the other half.
To restore a working tree file from the HEAD
-commit copy, you would use:
git restore --source HEAD --staged --worktree -- README.md
for instance. (This is the fully spelled out version; shorthand is allowed, but I'll skip it here as this answer is already quite long).
If you don't have this version of Git (2.23 or later), you can achieve the above with:
git checkout HEAD -- README.md
This does in fact still work in Git 2.23 or later, so you can use this form (which is already shorthand) in the most modern Git versions, too.
Note that these wipe out the version of README.md
you have in your working tree. Git will not be able to get back any version that was not already committed. To get back the version from some historical commit—rather than from the current or HEAD
commit—just replace the source part, HEAD
, with the raw hash ID of that commit, or with any of the spellings that will let Git find that hash ID: see the gitrevisions documentation again.
(The reason git checkout
got split up is that the git switch
set of operations are the ones that are "safe": Git will check whether you're destroying unsaved work, and tell you so, unless you force the operation with --force
. The git restore
set are "unsafe": they assume you know that you're telling Git wipe out my work, and just do it. Putting both under one front end, git checkout
, is a recipe for disaster: people learn that git checkout
is safe... and it is, until it isn't.)