The linked question (How to remove/delete a large file from commit history in the Git repository?) is appropriate after you fix your index situation. First, though, you need to fix your index situation.
You mention:
At some point as I was doing some git soft reset ...
git reset --soft
does not touch the index (nor your working tree), but can be used to change the commit hash ID stored in HEAD
. If you've done that, you may need to put the correct commit hash ID back into HEAD
, with git reset --soft
again and the correct commit hash ID.
That may suffice to fix everything, since git status
compares HEAD
(which is moveable) against the current index content, and then compares the current index content (which is changeable) against the working tree content (which is also changeable).
What you need to know about HEAD
, Git's index (or "staging area"), and your working tree
Git is really all about commits. It's not about files, though commits hold files. It's not about branches, though branches help you (and Git) find the commits. In the end, Git is all about the commits. So it's the commits that matter. But that should leave you with several questions, including:
- What exactly is a commit anyway?
- How do we find commits?
- How do we make new commits?
- Can we get rid of old commits?
- What is this index thing?
I'm not going to cover some of these here properly, to keep this answer shorter (or shorter for me anyway). But let's start with this, about commits: Commits are numbered. No commit, once made, can ever be changed at all. They are mostly-permanent (but see linked question), and totally read-only.
We (mostly) make new commits by manipulating existing commits. You can make a new commit totally from scratch, but that's usually way too painful for anything except the very first commit ever. So, to make a new commit, we have to take an existing commit, and change something in it. That's a contradiction, by definition: a commit can't be changed, but we need to change something to make a new commit. How do we solve this conundrum?
The answer is simple enough. We don't change the commit. We copy the commit out to something we can change, change that, and use that to make the new commit. So we don't work on commits: we work on stuff copied out of a commit.
Virtually all version control systems do this sort of thing; Git is not really different than SVN or Mercurial or whatever here, in that we first extract some commit, then work on it, then use that to make a new commit.
But Git is different here, for no obvious reason at first. With other version control systems, you extract the commit to a working area, where you work on it, and that's all there is. In Git, you extract the commit to a working area—your working tree or work-tree—but also to a proposed next commit. For historical reasons, Git has three names for this proposed next commit, calling it the "index", or the "staging area", or—a term mostly found in flags like git rm --cached
these days—the "cache".
You then work on the files in your working tree, like you would in any version control system. But when you're satisfied with a working-tree file, you must run git add
on it. You don't have to do this in Mercurial or SVN,1 because in those systems, the working tree file is the proposed-next-commit version of the file. In Git, you have to do this: the git add
command copies the file back into Git's index, making it ready for the next commit.
1Except, that is, for all-new files. That's because, e.g., Mercurial has things called the "dircache" and "manifest", which play a similar role to Git's index, but Mercurial keeps these hidden so that you don't have to learn about them. Git, by contrast, whips out its index now and then and slaps you in the face with it (Monty Python fish-slapping dance). You aren't allowed to ignore it. The git commit -a
shortcut sometimes almost gets you there, but it's not sufficient: you must learn about Git's index.
Branch names find commits, and commits find commits
Commits are, as I said, numbered. These numbers look random (though they aren't actually random) and are huge and ugly hexadecimal strings. These are generally unusable by humans, so we don't (use them, that is). These are hash IDs or object IDs (OIDs); Git uses OIDs everywhere, including internally.
Commits are also two-part units. One part holds a snapshot of every file, stored in a special, read-only, Git-only, compressed and de-duplicated fashion. The de-duplication takes care of the fact that most commits mostly re-use the files from earlier commits: this keeps the commits from taking huge amounts of space. (In fact, if you make a new commit that undoes what some previous commit did, the stored files for the new commit may take no space at all, since they're now all duplicates.) You don't have to worry about how Git does this: this part works great and doesn't whack you over the head the way the index does.
The other part of each commit is its metadata, or information about the commit itself. This contains stuff like the name and email address of the person who made the commit, some date-and-time stamps, and a log message. When you make a new commit, you supply the log message, and your user.name
and user.email
settings supply the name and email address. That's all pretty straightforward, but there's one part here that isn't: Git adds, to this metadata, a list of parent commit hash IDs. For most commits, there's exactly one parent.
When you make a new commit, you're doing so by working on some existing commit. Git stores, in your new commit, the hash ID of the commit you chose earlier to work on. So your new commit has that commit's hash ID as its parent. Then Git writes the new commit's hash ID into the current branch name.
This deserves a bit of illustration. Suppose we have the following chain of commits:
... <-F <-G <-H <--main (HEAD)
where H
stands in for the most recent commit's hash ID, and H
is the commit we've checked out. main
is our branch name, and the name main
holds H
's hash ID, which is how Git found H
, when we said git checkout main
or git switch main
.
Commit H
stores, in H
's metadata, earlier G
's hash ID. We say that H
points to G
, hence the arrow in the drawing from H
, pointing to G
. Commit G
is thus the parent of commit H
. Both G
and H
have full snapshots of every file (with de-duplication), so Git can compare the two snapshots to see what changed between G
and H
. And, G
being a commit, G
has in its metadata the hash ID of its parent commit F
. F
points back to yet another earlier commit, and so on.
Anyway, we now manipulate files in our working tree and in Git's index, and make a new commit, which gets a new, unique, random-looking hash ID we'll just call I
. New commit I
points back to existing commit H
:
... <-F <-G <-H <--main (HEAD)
\
I
and the very last step of git commit
is that Git writes I
's hash ID, whatever it is, into the name main
:
... <-F <-G <-H
\
I <--main (HEAD)
and so now main
points to commit I
instead of commit H
.
git reset
, with --hard
, --mixed
, and --soft
What git reset --soft
does is allow you to move the branch name. What git reset
does in general is ... absurdly complicated.
Let's draw a more complicated and useful Git graph:
I--J <-- br1
/
...--G--H <-- main (HEAD)
\
K--L <-- br2
Here, we have a repository with three branch names, main
, br1
, and br2
. The name HEAD
is currently attached to the name main
, which selects commit H
. The names br1
and br2
select commits J
and L
respectively.
If we run git merge --ff-only br1
, we end up with:
I--J <-- br1, main (HEAD)
/
...--G--H
\
K--L <-- br2
If that was a mistake, we can run:
git reset --hard HEAD~2
(the ~2
means count back two first-parent links; I won't go into a lot of detail here, and won't cover what the --ff-only
meant either) and we'll be back to this:
I--J <-- br1
/
...--G--H <-- main (HEAD)
\
K--L <-- br2
It's as if nothing happened. The --hard
here affected both Git's index and our working tree.
Here's what actually happened:
First, git reset
does the --soft
step. We give it a commit hash ID, such as the raw hash ID of commit H
, or a relative commit instruction like HEAD~2
. Anything that the git rev-parse
command will take is usable here. Git finds that commit, such as commit H
. It then makes the branch name to which HEAD
is attached point to that commit. So now main
points to H
.
Then, if we let it—if we use --mixed
or --hard
—git reset
resets Git's index. It does this by removing all the files that came from the commit we were on (J
) and installing instead all the files that came from the commit we moved to (H
').
Then, if we tell it to—if we use --hard
—git reset
resets our working tree. For all the files it ripped out of Git's index and replaced with files from H
, it rips those files out of our working tree and replaces them with files extracted from commit H
.
So that's how git reset --hard
puts us back to before the git merge --ff-only
: it:
- moves the branch name (
--soft
); then
- updates Git's hidden index / proposed-next-commit (
--mixed
); then
- updates our working tree (
--hard
).
Using the --mixed
or --soft
flags just makes git reset
stop earlier, after doing the second step, or the first step.
(Note that git reset
has other modes of operation. If this were all it did, it wouldn't be so absurdly complicated.)
Note that if you were to now use git reset
to point to commit L
, you would have:
I--J <-- br1
/
...--G--H
\
K--L <-- br2, main (HEAD)
What, if anything, happens to Git's index and your working tree depend on the flags you give to git reset
.
(The hash IDs of the various commits you've reset to get stored in the HEAD
reflog, so git reflog
will show them. This is a way to find which commit you want to go back to, if you accidentally reset away the hash ID you can't now find. Use the reflogs to find hash IDs that you have lost. Note that the hash IDs are really difficult to remember: you might want to run git show hash
or git log -1 hash
or similar, using cut-and-paste for the hash IDs, before using git reset --soft
, to find out which hash ID holds which commit of interest.)
git status
and other similar comparators
The git status
command works in part by running two git diff
s.
The first of these two diffs is:
git diff --staged --name-status
which compares whatever commit HEAD
names—all the files stored in that commit, that is—to the files in Git's index. Since these files are normally copied out of that commit, any file we didn't update since then will match. Git won't say anything at all about the matching files.
If we did update some file (e.g., with git add
, which I haven't covered here), the file might not match. Then git status
will say that the index copy of the file is a change to be committed.
If we move HEAD
(and the current branch name) around without changing the index content, we'll have the two out of sync, and many files might be changed, or even deleted. For instance, if we move main
backwards from J
to H
, but leave the index alone, all the files that are different between H
and J
will show up.
The second comparison git status
does compares the files in Git's index to those in your working tree. This is a lot like running git diff --name-status
with no options. For each file that matches, Git will say nothing at all. Where files are different—where you've modified a working tree file, but not yet run git add
on it—Git will list the file as a change not staged for commit.
(There's a big complicated section here that I will omit for space reasons, talking about how files that are in your working tree, but aren't in Git's index, are untracked files. Git would complain about these unless they're listed in .gitignore
. The .gitignore
entries don't actually make Git ignore the files, so .gitignore
is a misnomer. But for space reasons I am omitting all of this here.)