why is github showing all my previous commits in every new repository i push to?

Question

am not a github pro thats why i need clarity. Each time i create a new repo on github and push new cntent to it i discover that all the previous commits on other repositories are also added in my new repo. making all my repos similar to each other except for one or two folders which is not looking good. when i add new folders in my repo folder on my local machine, that is my laptop, i then navigate to that repo folder and then to the new folder i added and run the git add and run git status, i see nothing has been added for staging. However when i go back to the repo folder and run git add . it adds all the content of the new folder i just added which sounds good but when i make a commit and push, it adds all previous commits to that new repo.

no i didn't. so when i add a new folder to my main folder (repo) on my local machine i should run a git init right? so after i run that command, what next? — urchin55, Sep 05 '22 at 07:47
It looks like your ran `git init` in a higher directory (your home directory perhaps ?) and you are versionning all your projects together in the same, single git repository. — LeGEC, Sep 05 '22 at 08:08
Creating a fresh new repo is as easy as `git init` in the correct directory. This will however not keep the history of that project (if there is any). Do you need to separate the history of several of your existing projects ? — LeGEC, Sep 05 '22 at 08:10

score 0 · Answer 1 · answered Sep 05 '22 at 09:06

This is a Git issue rather than a GitHub one, and it stems from not being correctly taught how to use Git in the first place. (Unless you're entirely self-taught, this is not your fault, or not entirely your fault. If you are learning on your own, be aware that a whole lot of Git tutorials are anywhere from not-very-good to downright bad.)

Git is mainly about commits. A Git repository is, first and foremost, a collection of commits. As such, you need to know what a commit is and does for you, how you make a new repository, how Git finds a repository, and, well, a great deal more: Git can get pretty darn complicated. But let's start simple:

A Git repository consists primarily of a pair of databases. One database holds commits and other supporting objects, and the other database holds names such as branch and tag names. We'll get to the reason for the second database in a moment.
A useful Git repository—the kind you use on your laptop, for instance—also comes with a working tree, which we'll get to as well. (The kind of Git repository found on servers such as GitHub or Bitbucket is a "bare" repository, which simply omits the working tree so that nobody can work on it: this is useful because it allows git push to work to send to that repository. But that's not what you want on your laptop. You want a normal, non-bare, repository.)

Git puts these databases, plus a bunch of auxiliary files that Git needs to operate, into a hidden directory or folder (these terms are now interchangeable, so use whichever one you prefer). This hidden folder is named .git and, in most normal repositories, this hidden .git folder is found at the top level of the working tree. This is in fact how Git figures out what your working tree is.

Suppose, for instance, that your home directory (folder) is /home/bob (Linux) or /Users/bob (macOS) or whatever. Inside this folder, you probably have many sub-folders. I like to organize my sources within a src directory for instance, and some subsystems like Go require (or used to require) particular paths as well. So you might:

cd /home/bob/src/fun
mkdir experiment1
cd experiment1

to create a new for-fun experimental source project. Within this you'll have frontend/ and backend/ sub-folders, perhaps, for a web server toy of some sort, or whatever it is that you plan to experiment with. If you'd like to have one Git repository to contain all of these, you would now run:

git init

to create a new, totally-empty repository (two empty databases) where you are now. This will create /home/bob/src/fun/experiment1/.git/ and fill in a bunch of files to hold the various databases and other stuff.

You're now free to create sub-folders and cd into them. Let's say you make py as a Python-code sub-folder here, and cd into it. When you run Git commands, they'll look for the hidden .git here, and not find it. So they'll climb up one path element, from /home/bob/src/fun/experiment1/py/ to /home/bob/src/fun/experiment1/ and look for the hidden .git here. This time they find it so your working tree is rooted at /home/bob/src/fun/experiment1/. The repository proper is in /home/bob/src/fun/experiment1/.git.

The first commit is weird and a bit special

At this point, you still have no commits at all. A repository with no commits cannot have any branches. Fortunately Git doesn't need a branch to operate, but you really want to have a branch name, so Git sets one up anyway, even though you don't have it. If you run git status at this point, Git will tell you that you are on branch master or on branch main or whatever, even though that branch doesn't exist!

You can change the name of the branch that doesn't exist, but that you're on, with git init --initial-branch=main for instance, but since it doesn't actually exist anyway, it's not really very important. Still, you'll want to make a first commit pretty quickly, so that the branch name starts existing. I like to create an initial README file or similar and commit that:

$ echo playing around with the foo language > README
$ git add README
$ git commit -m "initial commit"

(the initial commit message here is dull and template-y as the initial commit itself is dull and template-y).

Having this initial commit, so that you have a branch whose name you can now change, is nice. But if you intend to use GitHub or some other web hosting site, you can choose to skip all of this work by using their web site to create a repository with one initial commit in it, using the clicky web buttons or whatever. If you do that you can then git clone the initial single-commit repository to your laptop, skipping the whole git init-and-create-one-commit steps on your side. The git clone operation will set up the remote named origin for you and copy the one initial commit, putting all of these into a new repository made by making an empty directory—that same mkdir experiment1 above—and then running git init in it for you.

It is your choice as to whether to make your own initial commit, or let GitHub or whatever make one for you, but pick one or the other: don't do both! If you do both, you'll get two initial, slightly-weird-and-special commits, and this makes trouble later. (It's not insurmountable but it's an annoyance that you can skip if you just pick one.)

If you do run git init yourself and make your own initial commit, remember to create your GitHub-or-whatever repository with no initial commit, and do the git remote add origin step yourself.

Commits

This is what you need to know about commits:

Each commit is numbered. The numbers are huge, weird and random-looking as they're expressed in hexadecimal, and they are unique. Git generates them on its own (using a cryptographic checksum technique and the huge range of possible numbers to try to guarantee this uniqueness; Git doesn't work unless the numbers are unique).
Each commit stores two things: a full snapshot of every file and some metadata.
All this stuff is read-only. (That's required to make the numbering scheme work.)

There's more to learn, but we'll stop here for now.

The full snapshot is stored (indirectly) via other non-commit objects, in that big objects database, along with the commit. The files in this full snapshot have names such as py/main.py, complete with embedded (forward) slashes (even on Windows): there are no folders, just these file names. Git knows about folders and will make them as needed on your computer so as to be able to extract the committed files later, but Git doesn't store the folders (which is why you can't store an empty directory). The files are kept in a special, compressed and (importantly) de-duplicated format, so that the Git repository doesn't become grossly fat even though every commit stores every file.

The metadata in a commit holds information about the commit itself. This includes your name and email address, for instance, and some date-and-time-stamps and so on. Crucially for Git's own operation, each commit's metadata holds the raw hash ID of some earlier commit(s). Most commits hold exactly one such hash ID: these are ordinary commits. Some commits are merge commit and hold two (or, technically, two-or-more) previous-commit hash IDs.

At least one commit in any non-empty repository is special: its list of previous commit hash IDs is empty. That's the weird, slightly-special one we made as a sort of template initial commit, just to get it out of the way. That commit will ultimately be on every branch, as we'll see. Git calls this commit a (or the) root commit.

Git stores all of this stuff—the commit object, and all of its supporting objects—in the big all-objects database that makes up the bulk of almost all repositories.¹ This all-objects database is a simple key-value store with the hash IDs as the keys, so Git needs the hash ID to find a commit quickly. Hence we're going to have to memorize some hash IDs, although very soon, we'll see how we can avoid that.

Because each commit (except a root commit) holds at least one previous commit hash ID—and most hold exactly one—we only need to memorize the latest commit hash ID. Suppose we have a repository that has about eight commits in it. We'll assign uppercase letters to stand in for hash IDs, such as H for Hash, and draw one such commit like this:

<-H

That little arrow sticking out of the commit here, in this drawing, represents the stored hash ID in the metadata for commit H. That is, given the hash ID for H, Git can fish commit H out of the objects database (and quickly) and find inside it the hash ID of the previous commit. Let's call that commit G and draw it in:

        <-G <-H

Commit G, like H, is an ordinary commit, so it stores the hash ID of a still-earlier commit. Git can fish out this hash ID and use it to find that still-earlier commit:

... <-F <-G <-H

This goes on forever, or rather, until Git hits the very first commit ever: commit A, presumably. That's our root commit and it allows Git to stop working backwards.

Note that Git works backwards, starting from the last commit. We need to somehow memorize the hash ID of the latest commit. Hash IDs are ugly and impossible for humans to memorize, so what can we do here?

¹If you make a repository and put just a few tiny commits in it, then create millions of branch and tag names, the bulk of the repository consists of the names, instead of the commits. That's not at all normal but is the reason to weasel-word the statement for which this is a footnote.

Branch names memorize hash IDs for us

Instead of us memorizing the latest commit hash ID, why don't we store it in a file, or maybe even a simple key-value database? That's a great idea: let's have a database of names, like branch and tag names, and have each name store a hash ID.

If we define a branch name as "this thing holds the hash ID of the latest commit", and draw that in, we get:

...--F--G--H   <-- main

(assuming we're using branch name main). The name main holds H's hash ID, so H points to H, the way H points back to G, and G points back to F and so on. We get a little lazy about drawing the arrows from commit-to-commit, because they literally can't change. They're stored in the commit (in the metadata) and the commit is entirely read only. So if H points to G, this means H points to G forever. The arrows don't point forwards because we didn't know what H's hash ID would be when we made G. (The hash ID depends on everything, including those exact time stamps, and we don't know, when we make G, the time at which we'll make H in the future. We also don't know what G's ID will be, and the hash of H includes hashing the hash ID of G once it goes into H!)

If we have more than one branch name, each such name points to some commit. So if we want to make a new name now, such as develop or feature or whatever, we pick some existing commit—probably H since it's the latest—and make the name point there:

...--G--H   <-- develop, main

Now we need a way to remember which name we're using. We'll have Git attach the special name HEAD to exactly one branch name, like this:

...--G--H   <-- develop, main (HEAD)

This says that we are using commit H and we are doing so through the name main.

If we now run git switch develop or git checkout develop, we get this:

...--G--H   <-- develop (HEAD), main

We are still using commit H but now we are doing so through the name develop.

Why does this matter? The answer has to do with what happens when we make a new commit. For now, note that every commit is on both branches. As soon as we made a new branch name, commit H was on two branches. Before, commit H was on one branch. Nothing changed in commit H at all (because nothing can) but the set of branches that contain it changed anyway!

This is a clue: branch names aren't actually very important—except in one way: they let us, and Git, find some commit. Git needs the hash ID, and the branch name holds the hash ID. And that's pretty much it, and also explains the next bit.

Making a new commit updates the current branch name

Let's say we have this now:

...--G--H   <-- develop (HEAD), main

and we make a new commit in the usual way (without worrying about what that "way" is). The new commit will be an ordinary commit, so it will have a parent. The parent of the new commit will be whatever commit we're using when we make the new commit. We're using H, so the parent of the new commit will be H: our new commit will point backwards to H.

The new commit will get a new, unpredictable (random-looking but not actually random) hash ID that is unique—no other Git repository anywhere in the universe can be using that ID before we get it, and none can use it after we get it²—but we'll just call it I, and draw it in:

....--G--H   <-- main
          \
           I   <-- develop (HEAD)

Having obtained I's hash ID (by writing commit I into the database), Git stuffs that new hash ID into the current branch name, so now develop, to which HEAD is attached, points to I.

Commit H remains on develop and on main, but new commit I is—currently anyway—only on branch develop. Only the name develop finds I. There are two ways to find H: using main, or using develop and then working backwards one hop.

²This is mathematically impossible to keep up forever (see the pigeonhole principle and Git is doomed to fail someday. The size of the hash space puts that day far enough into the future that we can hope that we're all dead, and perhaps the entire universe has ended in an entropic heat death, before it occurs.

Your working tree, and Git's index / staging-area

I'm only going to touch on this lightly to keep things short (well, shorter anyway), but:

Checking out a commit, with git switch or git checkout, tells Git remove from my working tree all the files from the previous commit I was using, and put in place the files from the commit I'm switching to.

We need this because the files stored in every commit are frozen for all time, as a sort of permanent archive. They're in a format that only Git can read, and nothing—not even Git itself—can overwrite. To get any actual work done, we need the files in a format that every program can read and write. These "working" files go into a "working area", which Git calls your working tree.
Weirdly—or at least, it's weird if you are used to other earlier version control systems—at this time Git fills in an extra copy of each file, into what Git calls its index or staging area. This thing is so important, and/or so badly named originally, that it actually has three names: the third one is the cache, but nowadays that name is mostly seen in flags like git rm --cached.

The files in this extra in-between area—it sits between the frozen-for-all-time current commit and the working tree—are in the frozen format, pre-de-duplicated. But unlike the files in a commit, they can be replaced wholesale. This is what git add does.

When you run git add on a working tree file, Git reads the working tree file, compresses it into the internal format, and checks for duplicate content. If the content is already stored anywhere in any commit in the repository, Git re-uses that file at this point. If not, Git now has a prepared copy, ready to go into a new commit.

Either way, before you ran git add, the index held a copy of the file, pre-de-duplicated (hence just a reference to the same file that's in the commit), ready to be committed, and after you run git add, the index holds a copy of the file, pre-de-duplicated and ready to be committed.

In other words, at all times, the index holds all the files, ready to be committed. This is your proposed next snapshot. It starts out matching this snapshot, from the commit you've checked out. As you run git add, you update the proposed next snapshot.

If you add an all-new file, Git compresses and checks for duplicate content as usual, and then writes a new name—this is where the forward-slash names come from—into the index, in a new index slot, instead of booting out the old prepared file. So you can add all-new names. You can also remove names from Git's index, using git rm. In fact, since git add means make the index copy match the working tree copy, you can use git add to remove index copies!³

Ultimately, you run git commit, and at this time Git packages up whatever is in its index right then and that's the snapshot for the new commit. Everything you do with your working tree is irrelevant unless you invoke git add or git rm or some similar Git command to make Git update its index. At the time you do run git commit, Git takes a snapshot of the staged files, carefully arranged in its index / staging-area to make the prettiest picture you can construct. So that's why you can call it the staging area.

³That is, suppose you check out a commit and it has a file named path/to/file. You remove that file from your working tree, but not from Git's index. Then you run git add path/to/file. Git sees that there's a copy in its index, but that the working-tree copy is gone. The git add command takes that to mean remove the index copy to match. I find this weird, almost even spooky, myself and never use it on purpose; I prefer to run git rm instead.

A brief note on `git status`: use it often

The git status command:

prints out the name of the current branch: on branch main or whatever;
checks to see if the current branch has an upstream, and if so, will tell you about being ahead and/or behind (we won't cover this at all here); and then
runs two git diff operations.

The git diff operation compares two snapshots. You can use it on any two commits, and it will tell you what's different in those two commits. But you can also use it on Git's index / staging-area, and on your working tree, and that's how git status tells you about changes staged for commit and changes not staged for commit.

When you compare two actual, existing commits—such as G and H in our diagrams above—Git extracts both snapshots and looks to see which files are exactly the same and hence de-duplicated, and which ones are different. The exactly-the-same, de-duplicated files are boring and it doesn't mention those at all! It only mentions the files that don't match.

When using a full git diff, Git plays a game of Spot the Difference for each doesn't-quite-match pair of files, and prints out a recipe for changing the old or left-hand-side version of the file into the new or right-hand-side version. But you can ask it to just say which files and whether they're modified, or newly added or deleted.

The git status command uses this faster mode—the "just tell me which files are changed" mode—to compare the current commit and the proposed new snapshot. For each file that's changed (or new or deleted), git status tells you that file's name and says that it is staged for commit. It says nothing at all for files that match: those are boring!

Then, having printed out this list, git status does a second fast-mode diff to compare each index file to each working tree file. Here, for each file that's different, git status lists them out again, this time putting them under the not staged for commit section.

Note that it's possible to "stage a change" and then change the same file again:

$ git switch main
$ echo more data >> existing-file
$ git add existing-file
$ echo another change >> existing-file

Now all three copies are different, so git status will list existing-file as both staged for commit and not staged for commit. That's not how most people do most of their work most of the time, but if you get to advanced Git-ing, you can use git add -p to set this kind of thing up on purpose.

There's one last thing to touch on here, and that's the untracked files. Again, I'm going to omit all of the important details, but there is a simple definition here for "untracked file": An untracked file is a file that exists in your working tree right now but does not exist in Git's index right now. That's all it is. Since you can change what's in Git's index, you can change whether some file is tracked or untracked—but remember, switching from one commit to another commit also changes what's in Git's index, by filling in Git's index from the commit to which you're switching.

This omits all kinds of special cases (see, e.g., Checkout another branch when there are uncommitted changes on the current branch), but it's plenty to start with.