I've got several files in my local master branch that were added by mistake and now I want to delete them to keep the branch 'clean'. What I don't want is to stage and commit the deletions, as then my local master will be out of sync with the remote master. I there any way of deleting the files without Git knowing about them?
-
2If they were added by mistake, why wouldn't you want to update the repository? – Jan Wilamowski Sep 28 '21 at 09:37
-
I just want the files gone, no nee to create a commit that then I'd have to push to the remote master to keep everything in sync (but I know I can't push to the remote master) . @JanWilamowski, I don't see why I would update the repository- but I'm relatively new to Git, so I might not be seeing the obvious reason. – alfavictor Sep 28 '21 at 15:56
-
1Git helps you to share code. It's unusual (although possible) to side-step this by customizing "your" version of the repository. However, it would be much more natural to fix the issue in the shared version, for everyone. If that is not possible, please provide more information. – Jan Wilamowski Sep 29 '21 at 03:34
-
@JanWilamowski The files I wanted to remove weren't just in my work-tree, they were in Git, as they'd been (wrongly) added and committed earlier. It was too late to simply remove them from the index with 'git rm files'. But I've sorted it out now (see below) – alfavictor Oct 02 '21 at 13:01
1 Answers
I suspect you have a fundamental misunderstanding of what Git is, and what Git does for (and to) you. This is leading you down the garden path. The fact is, nothing you do is in a Git repository unless and until you commit it.
From a high-level point of view, Git is really a database, or more precisely, two databases. One—usually the bigger one by far—is made up of commits and other supporting objects. This database is, or at least can be, distributed as well: there are multiple copies, such as one over on GitHub, one on your laptop, one on Fred's laptop, a fourth one over there on the left, and fifty more scattered about. You can, if you like, pick one and claim that that's the real one, but all of them are equal as far as Git itself is concerned. Or, more precisely, whichever one you're using is the real one, and all the others are the inferior copies.
Hence, the key to working with a Git repository is the commits. The commits are the reason this Git repository exists. Git is all about the commits. But what exactly is a commit? To answer that, we have to break it down a little:
Each commit is numbered. The numbers are weird, though. They don't just count: we don't have commit #1, followed by commit #2, then #3, and so on. That would work if there were only one repository, or if one of the repositories was "the real one", because then that repository would be the one handing out the numbers. If you wanted a new commit you'd have to go to the (real) repository and get the next number. Git doesn't do that, so instead, Git assigns each new commit a new and unique but random-looking and huge and ugly number, usually expressed in hexadecimal.
Each commit stores every file, like an archive. So if you have the commit's big ugly number, you have your Git reach into its database of all-commits-we-have. You have Git fish out the commit by that number and, poof, there are all the files you need.
And, each commit stores some metadata: information about the commit itself, such as who made it and when. There's some crucial-to-Git-itself information in this metadata that lets Git chain commits together, so that given the latest commit number, Git can find all the earlier commits.
It's possible that, given some commit number—Git calls this a hash ID or object ID or OID—your Git doesn't have that commit. In this case, there's nothing under that number: then you have to connect your Git to some other Git that does have that commit, give them that number, and have them deliver to you the commit and all its files. And, because each commit stores, in its metadata, the raw commit number of its predecessor commit, by default, you just have to give them the latest numbers and you'll get every commit. So usually, each repository has all the commits—and if you're missing some, you just connect your Git to theirs and get their latest, and now you have all the commits.
This means it's really easy to always have every commit. You just run git fetch
to get any new commits they've added since the last time you got the commits. So that's the first thing Git does for you: after git clone
, which gets you all their commits then, a later git fetch
gets you all their new commits. This usually goes really fast, vs the first clone, which may take many seconds or even a few minutes.1
Of course, since Git is all about the commits, you might want to make commits of your own. Once you do, you'll generally want to send these commits to other clones of this Git repository. That's the second thing Git does for you: it allows you to send your commits to some other Git repository, using git push
. This is sometimes as easy as git fetch
, and sometimes a little harder, for various reasons.
What if you don't want every commit ever? Git's first answer to that is: tough _____ (fill in the blank with your favorite expletive, or whatever). But there are some ways to limit how much you get, if you really need to. This mainly isn't how Git is meant to be used and it comes with certain limitations. The thing to realize is that Git isn't a software release tool. It's a distributed version control system. These have fundamentally different goals, though there's some overlap.
Now, I said there are two databases, and I've only described one: Git's object database, which stores the commit objects and other supporting objects. The other database that Git provides exists because the names of commits are those big ugly hash IDs, or OIDs, like 9cfc797623711f4279e0cb86360236a1b8b7b16e
for instance. These things are useful to computers, but they're quite harmful to humans. Try eating these things all day, and you'll barf your guts out. We'd rather have names, like main
or master
, or version numbers like v2.4.0
. So the second database that a Git repository provides is a database of names. Each name stores one of those big ugly OIDs. We can then refer to a commit by name:
git switch develop
gets us the latest commit from branch develop
, for instance. This invites the obvious question, What exactly do we mean by "branch"? Unfortunately Git has multiple conflicting answers.
The funny thing about this last database is that it's not consistent between different clones of a Git repository. In particular, your branch names are yours, different from the branch names in any other repository. This winds up being one of the things Git does both for and to us, which creates a lot of confusion.
So, at this point in the answer, here's what you need to remember:
- A Git repository is two databases: objects (commits and others), and names such as branch and tag names.
- You typically get a repository by cloning some existing one. This copies all of their objects, but doesn't quite copy their names.
- Specifically, your branch names aren't anyone else's branch names. When you clone a repository, their branch names become different names in your clone.
- But to some extent that's not important. The important part of a repository is the commits. The commits are found by hash IDs AKA OIDs. The names, if any, are just there to let humans remain sane.
- So, to a first approximation anyway, what's in the repository is the commits. The commits are the be-all and end-all; the other parts are just there to support the commits.
- Commits are entirely read-only. This is required to support the numbering scheme for the hash IDs / OIDs. This means that everything in a commit is literally frozen for all time. That has one really big consequence.
1When I first started using Git, we were lucky if we had network connection bit rates that got us a repository in less than an hour, sometimes. (I had DSL running at about 300 kbit/s effective. See also this Quartz article.) Big repositories can still take a long time to clone, like downloading DVD images, and if your network is really slow you might feel it a lot. Fortunately git fetch
is still normally pretty fast.
Git's index and your working tree
Given that commits are frozen for all time, how can we get any work done? We literally can't change anything about a commit. So, while a commit holds a full snapshot of every file, it's not useful, yet. We might also note that the snapshots are in a special Git-only compressed form, with files de-duplicated, which takes care of the obvious objection to storing every file over and over again. But this just makes things worse: not only are the files not writable, they're not even readable except by Git itself.
So, to make things usable, what Git does is extract a commit. When you run:
git checkout main
or:
git switch develop
you're picking one commit that Git should copy out of the repository. This copy goes into your working tree. Another "copy" of sorts goes into Git's index, which is a central and crucial part of making new commits, but we'll only touch lightly on the index here.
While your working tree, or work-tree, is where you see and work with your files, it is not in Git. This is the crux of your question: if you've created files in your work-tree, they're just files in your work-tree. They are not in Git at all. If you've never put this file into Git, it is just sitting there in your working tree. Remove it, and it's gone. It was never in Git at all.
That may be the only part of the answer you need, but read on
The index I described above is heavily involved in making new commits. Git also uses it to manage parts of your working tree. This thing, this index, is so important—and/or so poorly named—that it actually has three names in Git: the index (the name we've seen several times now), the staging area, and the cache. The name staging area refers to how you use it: you "stage files" for commit by having Git copy them into the index.
The index starts out, at git checkout
or git switch
time, with a "copy" of every file from the commit. I put "copy" in quotes here, and earlier, because the files that are in Git's index are in Git's compressed-and-de-duplicated form, like the files in a commit. What's different about the ones in a commit vs the ones in the index is that the ones in the index can be replaced, removed, or added-to. That is, these ones aren't read-only. (The underlying data bytes are read-only, when what's in the index is already a duplicate of what's already in some existing commit, but Git can take that file out of the index and put in a different one with the same name.)
When you run git add
on a file, you are telling Git: read the work-tree version of this file; compress and Git-ify it, and see if it's a duplicate. If it is a duplicate, Git will link the right duplicate into the index, ready to be put into the next commit. Otherwise, Git will save these data bytes temporarily and link them into the index, ready to be put into the next commit.
You can also run git rm file
: this removes the named file from both Git's index and your working tree. If you just want to remove it from Git's index, leaving the working tree alone, you need to git rm --cached file
(this is where the old name "cache" shows up). That removes the copy (or "copy") from the index, but leaves the working tree file alone.
Note that since the index started out with the files from the current commit, any update to the index just updates the proposed next commit. Initially, the proposed next commit is a copy of the current commit. Adding new files, or adding updated files, or removing files, updates the proposed next commit. Hence Git's index can be viewed as the proposed next commit. You manipulate this index, using your working tree to make files that you can actually see and edit. Git builds the next commit from whatever is in Git's index, not from what is in your working tree. So the files in your working tree are literally not in Git. The index "copies", once you git add
a file, are (at least temporarily) in Git: that temporary existence becomes permanent once you run git commit
.2
This also lets us define what an untracked file is, in Git. The definition is very simple; all the complications come from the fact that the index itself is complicated. An untracked file, in Git, is one that is in your working tree right now but not in Git's index right now. Since you can change what is in your working tree any time, using ordinary computer file operations, and you can change what is in Git's index any time using git add
and git rm
, the set of tracked and untracked files can change, even if you never make any commits or change from one commit to another. But the definition of untracked file itself is simple.
2It is possible, though generally quite difficult, to get rid of commits. So the files are only as permanent as the commits themselves. They are fully read-only at all times, though.
Untracked files, git status
, and ignored files
If you've created new files in your working tree and never added them to Git, they are necessarily untracked files. They never went into any commit, and you have not added them to Git's index. The git status
command would normally report these as untracked. But, to make git status
useful, we can tell Git to shut up about certain untracked files.
Listing a file's name, or a glob pattern like *.pyc
or *.o
, in a .gitignore
file tells Git: when this file is untracked, don't bother me about it. It does not cause the file to be untracked, it just makes Git shut up when it is.
The git add
command allows you to add "every file" or "every file matching some pattern", e.g., git add .
or git add *
. When you use these kinds of en-masse "add a bunch of files", it's helpful if git add
doesn't add files that are currently untracked and are ignored. So .gitignore
also means if already untracked, and I add this, don't actually add it.
This means that .gitignore
is the wrong name for the file. It should be called .git-do-not-complain-about-these-files-if-they-are-untracked-and-if-they-are-untracked-and-I-use-an-en-masse-add-to-try-to-add-everything-do-not-add-these-files-either
, or something like that. But that name is just ridiculous, so .gitignore
it is.
Your branch names aren't their branch names
Suppose you and a colleague—let's call him Bob—are doing work in some repository. You both clone some GitHub repository, so that both of you have all the commits in your laptops. You both git checkout main
or git checkout master
and start working.
You make a new commit, where you've modified one file and added some other new file. Meanwhile, Bob also makes a new commit, where he's modified one file and removed a file. These two new commits have two new and unique and random-looking hash IDs. So maybe the commit that you both started with was a123456
(for short), and your new commit is feedcab
and Bob's is bedbead
.
It's time to talk now about how branch names help you, and Git, find commits. I mentioned earlier that the metadata inside each commit includes information that lets Git chain commits together. So commit a123456
holds inside it the raw hash ID of some earlier commit, such as badf00d
. The earlier commit badf00d
has metadata that, inside itself, holds the raw hash ID of an even-earlier commit, deadace
, and so on.
These linkages form a backwards-looking chain. If the hash ID of some commit is H
, and H
contains in its metadata the hash ID of earlier commit G
, which contains in its metatdata the hash ID of still-earlier commit F
, and so on, we can draw that chain, like this:
... <-F <-G <-H
Each commit stores both a full snapshot of all files, and the hash ID of some earlier commit, so by extracting both commits H
and G
, Git can compare G
's snapshot to H
's. This tells you what changed between commits G
and H
, which is often more interesting to a human than the exact set of files in G
and H
.
Similarly, by moving back one step to G
, Git can now extract both F
and G
and see what changed. So this allows a command like git log -p
to show you each commit along with what changed in that commit. To figure out what changed, Git looks backwards one hop. Then, having shown you what changed, git log
moves backwards one hop, and repeats the process. This goes on and on until either you get tired of it and quit out of git log
, or Git has moved backwards through all of history, to get to the first commit ever. That commit—commit A
in our little diagram here—has no earlier (parent) commit, so that lets git log
stop going backwards.
To kick this whole thing off, though, Git needs to know the hash ID of the last commit in the chain, commit H
. To get that hash ID, Git will generally use a branch name. Let's say the branch name is main
:
...--F--G--H <-- main
The name main
contains only the hash ID H
. Everything else flows from there: commit H
itself contains hash ID G
, which contains hash ID F
, and so on, backwards, down the line.
When you, using your laptop, make a new commit—let's call it I
of course—in your repository, that commit gets a seemingly-random hash ID (feedcab
, we said, but now we're just calling it I
for even-shorter). Your commit I
stores, in its metadata, the hash ID of earlier commit H
:
...--F--G--H--I
In your repository, Git now writes I
's hash ID into the name main
:
...--F--G--H--I <-- main
So your main
points not to H
any more, but to I
.
Meanwhile, Bob made his new commit: bedbead
, but let's just call it J
for even-shorter. So in Bob's repository, Bob has:
...--F--G--H--J <-- main
Bob's main
points to J
.
In order to combine your work and Bob's work, we'll need to gather, in some repository somewhere—this could be yours, or Bob's, or yet another clone—all of the commits. When we do that, we can't use the name main
alone any more, because each name is only allowed to select one commit. So I might do this:
I <-- alfavictor
/
...--F--G--H <-- main
\
J <-- bob
in my repository, if I were doing this. Or I could name my branches with slashes, to help remind me whose main
each one is:
I <-- alfavictor/main
/
...--F--G--H <-- main
\
J <-- bob/main
or something similar.
Since my branches are mine to do with as I please, I get to make up these names. My Git will, however, create remote-tracking names. When I get commits from a shared GitHub repository, I will probably have my Git take their branch names and turn them into remote-tracking names of the form origin/*
, where the *
matches their branch names. So I'll have:
...--F--G--H <-- main, origin/main
where my main
and my origin/main
both select commit H
.
If one of you—let's say Bob—will git push
your new commit to GitHub first, and GitHub accept it and put it into the GitHub clone, and I pick it up, I'll get this in my repository:
...--F--G--H <-- main
\
J <--- origin/main
If you, alfavictor, now run git fetch origin
to pick up new commits from the origin
repository over on GitHub, you will wind up with this in your repository:
I <-- main
/
...--F--G--H
\
J <--- origin/main
These are branches, of a sort. You can now use Git's facilities to combine your work with Bob's work. (This part gets complicated, and we won't cover it here.)
Note that if you never intend to send commits back to GitHub or wherever, you don't have to—but if you want Git to keep track of your files, you will need to git add
and git commit
them. This makes a read-only snapshot that will live forever, or at least as long as your commits exist. If you git push
these new commits somewhere—such as to a "fork" on GitHub—then even if disaster strikes and your laptop explodes into a ball of flame, you can get your files back from GitHub, by cloning your own fork.
If you've added and committed a file by mistake, your best bet is usually to remove and commit the removal. You'll have one commit whose snapshot has the file, followed by a later commit whose snapshot omits the file. These commits are all read-only, and as permanent as any commit—each exists as long as its hash ID lets Git retrieve it from somewhere—but checking out the latest one gets you a commit without that file.
If you don't want the file to show up anywhere (e.g., if it has sensitive data), you can combine these two commits together, into a new third commit:
I--J <-- oops
/
...--F--G--H <-- main
\
IJ <-- fixed
where branch oops
records commit J
, which fixes the mistakenly added file in commit I
, and then branch fixed
—which you grow from main
—combines I
and J
into a single commit IJ
that never adds the file. You then delete branch name oops
:
I--J ???
/
...--F--G--H <-- main
\
IJ <-- fixed
Without the name, nobody knows the hash ID of commit J
, so they can't find it. Without that, they can't find commit I
either. They can find H
—main
makes that easy—and they can find IJ
through the name fixed
, but they can't find J
or I
. And, if you never git push
commits I
and J
to some other Git, those two commits will remain only in your repository. Eventually—after a minimum of 30 days by default—your own Git will decide that you aren't ever coming back for those abandoned commits, and will sweep them away.3
(There are nicer ways to do all of this than the way I drew, but they all end up working the same way. I think it is a good idea to understand how and why this works first, before you get to the nicer-ways-to-combine-commits and clean up history. Note that history, in any Git repository, is the set of commits in that repository, as found by the names that let you find the commits, and then working backwards from commit to earlier commit.)
3Exactly when this happens is very hard to predict. There are some maintenance / janitorial commands that you can use to force it to happen faster, in the rare occasion that this is important.

- 448,244
- 59
- 642
- 775
-
thank you for taking the time to write such a comprehensive, clearly-worded, and logical explanation of what Git is and how it works. It has helped with a few bits of my understanding that were a bit hazy. – alfavictor Oct 02 '21 at 12:48
-
The files I wanted to remove weren't just in my work-tree, they were in Git, as they'd been (wrongly) added and committed earlier. It was too late to simply remove them from the index with git rm files. What I've done is similar to what @torek suggested: created a new branch in origin, checked it out in my local repo, deleted the files, added the changes to the index, committed, pushed to the remote. From there, pulled the commit and merged to master. Then checked out the local master and pulled from the remote master. The files are gone, working tree is clean, and everything is sync'd. – alfavictor Oct 02 '21 at 12:56