0

I'm working with a friend on a personal project but we are both kinda new to git.

We already had our project (a game server) and my friend installed github desktop on the server.

He added all the files from our game server, made a commit and some modifications that he also commited.

The thing is that the whole folder is almost 3go and most of it (~2go) are big modelisation files that we don't really wanted to commit.

Nothing have been pushed on git yet because github wont let us (from what i've seen it's because of the size of our project) so all the commits are in local.

How can we delete all the commits without losing our changes, so we can ignore the big files, add only those we want to track and finally push? Or is there a best way to do so?

Thanks.

nebra
  • 55
  • 1
  • 11
  • 1
    Does this answer your question? [How to remove/delete a large file from commit history in Git repository?](https://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-git-repository) – Jonathon Reinhart Jun 28 '20 at 13:41
  • It could but i'm affraid it would delete some files that i want to keep. Maybe the easiest way could me tu make a back up from our server, delete everything (git files repo etc) and restart from the beginning? – nebra Jun 28 '20 at 13:56

2 Answers2

4

TL;DR

You want to use git reset --mixed, which is the default kind of reset (so you can just use git reset origin/master, without the --mixed flag). This will reset away some commit(s). This will leave your work-tree undisturbed and you can then more-selectively add the desired files and make new commits that you can git push correctly.

Be careful here! git reset --hard also resets your work-tree and you do not want to do that.

Long

Technically, you can't directly delete a commit in the first place: what you want to do is have Git "forget" or explicitly abandon the commit, so that it can't find it by normal channels. The commit will remain in your repository for a while—30 days at least, with normal settings—in case you want it back. Then, at some indeterminate future time (or as soon as you force it by running git gc), Git will garbage collect the unreachable commit and get you the disk space back.

Here's what else you need to know.

Git is all about commits

Git is a distributed version control system (D-VCS). Being a VCS, Git's purpose is to save every version of every file ever, or at least every important version of every file you've marked for saving. To do this, Git stores, as a commit, a full copy of every file you tell it that it should save, each time you run git commit.

To keep this from making your repository grow enormously fat ridiculously quickly, the saved files stored inside Git's commits are kept in a special, compressed, read-only, Git-only format, which immediately de-duplicates the files (Git can compress them even further if and when that's appropriate, while still maintaining this read-only nature). So if commit #1 has a thousand big files, and commit #2 reuses 999 of those 1000 files, commit #2 really just has 999 connections to the already-existing big files and one new big file. Since all files are completely read-only, it's always safe to re-use some existing file.1

A commit, then, acts as a full snapshot of every file—or more precisely, every file you've told Git to track. We'll see more about tracked vs untracked in a moment. A commit has a bit more than just this snapshot, though: each commit also records some metadata, or information about the commit itself, such as who made (name and email address), when (date-and-time stamp), and so on.

Every commit has a unique "name" of sorts, which is the commit's hash ID. In fact every Git internal object has its own hash ID, but the ones you normally see are just the commit objects. This hash ID is actually a cryptographic checksum of the contents of the internal object. Git guarantees that each commit is unique—in part by sticking those date-and-time stamps on them, so that even if you re-commit the same files later, the new commit has a different time—so that each hash ID is also unique.2

Git hash IDs, such as f402ea68166bd77f09b176c96005ac7f8886e14b—a commit in the Git repository for Git—are big ugly strings of letters and digits. Given a hash ID, Git can immediately look up the object in its big database of all its objects.

So, given a commit's ID, Git can tell right away whether you have that commit, and if so, fish out all of its files. And, each commit, when made, stores in its metadata the raw hash ID of its immediate predecessor commit. So, given a commit, Git can find the hash ID of the previous commit, and find that commit. Then it can use that commit to find another previous commit, and so on. This action, of walking backwards from commit to commit using the hash IDs, is how Git stores the history. The commits, in other words, are the history in the repository. Each commit holds a snapshot of all of the files, and by moving from one snapshot back to a previous one, Git can find the history.

The bad thing about hash IDs, of course, is that they are big and ugly and impossible for humans to remember or work with. So, in general, we don't work with these hash IDs.


1There's a bit of trickiness to removing a commit, as git gc does, here, since it might be sharing files with other commits, but of course Git does it correctly.

2The Pigeonhole Principle tells us that there must be some hash collisions, but the chance of any particular pair of commits colliding is currently 1 in 2160, and will be 1 in 2256 once Git moves to SHA-256. This is small enough that it never actually happens in practice, despite the Birthday Paradox.


Branch names

This is where branch names enter the picture. While Git could, in theory, get everything done with just hash IDs, this would be a system that no human could use. We'd have to memorize any important commit hash IDs. But why should we bother? We have a computer: why not have it remember the important hash IDs?

This is what names—including branch names—do: they remember hash IDs for us. We pick a branch name, such as master or develop, and tell Git: remember the latest commit. How that works is really simple, but also very important to understand. Suppose we have a string of commits, with their big ugly hash IDs represented here by simple uppercase letters:

... <-F <-G <-H

Here H is the latest commit. As we noted earlier, each commit holds the hash ID of the previous commit, so H holds G's hash ID. We say that H points to G. G, of course, comes after F, so G holds F's hash ID and therefore points to F. F in turn points back to some earlier commit, and so on.

If H is the latest commit on master, the name master holds the hash ID of—or points to—commit H:

...--F--G--H   <-- master

When we make a new commit, Git will make the new commit point back to the current commit H, and update the name master:

...--F--G--H--I   <-- master

This way, master always points to the latest commit.

If you have more than one branch name, each branch name points to the latest commit for that branch:

...--F--G--H   <-- master
            \
             I--J   <-- develop

Now it's hard to tell which branch name (and commit) we're using—so Git uses the special name HEAD, in all uppercase like this. We attach this name HEAD to one of the branch names:

...--F--G--H   <-- master (HEAD)
            \
             I--J   <-- develop

tells us (and Git) that we're using the name master and hence commit H. If we run git checkout develop or git switch develop, this changes to:

...--F--G--H   <-- master
            \
             I--J   <-- develop (HEAD)

which tells us (and Git) that we're using the name develop and hence commit J.

Note that the commits and their files never change. They literally cannot change: the commits and their files are completely read-only. The files are in a special, frozen, Git-only, de-duplicated format, and the commits' names—their hash IDs—are constructed from the data in the commit, so if you were to somehow change anything inside a commit, you'd get a different commit, with a different ID.

We can add new commits, and we can move the branch names around—we'll see how this works in a moment. But we cannot change existing commits. We can, of course, extract any existing commit, in order to see it and to work on it. Importantly, both we and Git find existing commits using these names, then working backwards.

The index and your work-tree

To do new work in Git, we start by picking some existing commit to check out (git checkout) or, in Git 2.23 or later, switch to (git switch). These are the same thing: checkout and switch do the same thing here. It's just that git checkout does too much, so in Git 2.23, they split it into two new commands, git switch and git restore (while still keeping the old command). Some uses of checkout do the git switch operation and some do the git restore operation. We won't look at the git restore subset of git checkout here, so that you can use either git checkout or git switch interchangeably.

Generally, we do this by picking a branch name. (If we need to use a historical commit we can ask Git to go into detached HEAD mode, but we'll ignore this particular trick here.) We tell Git to switch to using that branch name, and its commit, and Git extracts the files from that commit. It de-compresses them and turns them from special Git frozen files into regular, everyday files that we, and the computer, can use in their regular, everyday fashion.

The branch we choose, and its commit, become the current branch and the current commit. The files from that commit are now here in our work-tree, as ordinary files. We can do whatever we want with them, using all the commands the computer has. We can change them, rename them, remove them, and add new files. They're all just ordinary files, after all. The perhaps surprising part here is that when we go to make a new commit, Git does not use these files at all!

When Git first extracts the files from the commit, to set up the work-tree, Git makes a copy3 of each file in what Git calls, variously, the index, or the staging area, or (relatively rarely now) the cache. This copy is what Git will use for the next commit. It's already in the frozen and de-duplicated format, ready to go, so that makes git commit very fast: it does not have to waste time re-freezing each file. But this does mean that every time you change a file, you need to tell Git to copy it back into Git's index.

To copy a file into Git's index, you use git add. This reads the work-tree copy, and turns it into the frozen-format de-duplicated version that goes in the index. If the file was already in the index, that boots out the old one and puts the new one in. If the file wasn't in the index before, now it is. Either way, the file is now in the index, ready to go into the next commit.

What the index or staging area represents, then, is the next commit you are planning to make. You update it by copying files into it, or—with git rm or git rm --cached—removing files from it. The git rm command removes the index and work-tree copies of some file. With --cached, git rm removes the index copy, but leaves the work-tree copy alone.

Note that both the index copy of each file, and the work-tree copy, are temporary, as far as Git is concerned. Git stores the committed copies forever. Your work-tree is just yours, though—files here only last as long as you keep them around—and the index copies are there to go into the next commit but aren't saved-for-all-time4 like commits. And of course, when you use git checkout to extract some commit, you'll have Git overwrite both Git's index (to contain frozen files from the checked-out commit) and your work-tree (to contain the ordinary files extract from the checked-out commit).5


3Technically, what's in the index is not a copy, but rather a reference to the de-duplicated internal blob object. Git manages this invisibly and correctly so that you don't really have to care about this: you can just think of the index as holding a copy of each file. If you start using git ls-files --stage and git update-index, then you need to know about Git's internal blob objects, but for everyday Git use, the distinction is not important.

4Well, saved for as long as the commit itself lives, anyway.

5There are some safety checks here, to make sure you don't lose uncommitted work. This is where git checkout kind of falls down though: git switch has the safety checks, and the git switch style of checkout has the safety checks, but the git restore style of git checkout omits the safety checks. So with the old git checkout command, you can easily accidentally clobber unsaved work. The split into two commands was a good idea.


How Git makes a new commit

When you run git commit, Git simply packages up whatever is in the index right then, on the spot. Those are the files that will be in the new commit.6 Git collects the metadata it needs—your name and email address, "now" for the date-and-time, and a log message. Git uses the current commit as the parent for the new commit, and creates the new commit. If we have:

...--G--H   <-- master (HEAD)

then the parent of the new commit is hash H, and the new commit—we'll call it I again here—will point back to H. Then Git will write I's hash ID into the name master and we'll have:

...--G--H--I   <-- master (HEAD)

Since the new commit contains exactly those files that were in the index, the index and the new commit now match. So the current commit is now I, and the index matches I. If the index also matches your work-tree—if you git added all your changed files—then all three match, as if you had just checked out commit I, too.


6You can alter this behavior somewhat with arguments to git commit, and if the index exactly matches the current commit, Git won't let you make the commit unless you add --allow-empty, but this suffices to cover most normal uses.


Tracked and untracked files and .gitignore

This leads us to the definition of a tracked file, too. A tracked file is simply a file that is in the index right now. An untracked file is a file that is in your work-tree, but is not in the index.7

The git status command—which is a very useful command—will whine about untracked files. To make it shut up about them, you can list them, or their file name patterns like *.obj or tempfiles/*, in a .gitignore file. This file doesn't actually force the file not to be tracked! If the file is tracked, the presence of its name in a .gitignore is irrelevant. Only if the file is not tracked does the .gitignore take effect.

The effect of the ignore entry is two-fold: it shuts up git status, and it prevents git add from copying the file into the index.8 So in a sense, this should be called .git-do-not-complain-about-these-untracked-files-and-do-not-auto-add-them-either, or something along these lines. But that's a ridiculous name, so .gitignore it is.


7What about a file that isn't in the index and isn't in your work-tree? For instance, if you've never had a file named gronk and you don't have one now, what is the status of that file? Think about this as a sort of philosophical question to start with, but then ask yourself about a file that you put in one commit, then removed in the next one. That file isn't in the current commit, and is not here now, but it is in that commit. If you check out that historical commit, where will the file be? If you then check out the latest commit, the file will be gone again. Where did the file go? Was it tracked? When is it tracked?

Note that when you switch from a commit that has the file gronk to one that doesn't, Git has to remove the file from both the index and your work-tree, so that it's gone in the current commit and won't be in the next commit. When you switch from a commit (and index and work-tree) that don't have gronk to one that does, Git has to create the file in the index and work-tree. If you grab gronk from an old commit, without putting it into the index—or with git rm --cached to take it out—it becomes untracked. So the tracked-ness of a file can change over time and from one commit to another.

8It actually has a third effect, which is not one you see very often: it gives Git permission to clobber (overwrite or remove) the file in certain cases where git checkout or git switch would otherwise stop and tell you that the file is in the way. It's hard to describe this corner case correctly though.


git reset

The git reset command is big and complicated, and I kind of wish the Git folks would split it up like they did with git checkout -> git switch + git restore. The particular subset of reset we will look at here, though, is the one that moves the current branch name.

Suppose we have this situation:

...--F--G--H   <-- master
            \
             I--J   <-- develop
                 \
                  K--L   <-- feature (HEAD)

Here, the "stable" master branch identifies commit H. The name develop, where we're working on some new version, identifies commit J, and the current branch feature identifies commit L.

We can, at any time, tell Git to move any of these branch names to any of the various existing commits. We probably shouldn't move master if it's meant to be stable, and maybe we don't want to move develop either for some reason, but if we just made commit L just now on feature, and it turns out that commit L is terrible, maybe we'd like to just get rid of it.

To move some branch name that isn't the current one, we can use git branch -f. But for a good reason—namely, the presence of the index and our work-tree—git branch won't move the branch we're using. To do that, we have to use git reset.

This kind of git reset takes two things:

  1. It needs the name for the commit we want to move to. This can be a raw hash ID, or a commit relative to the current commit, or another branch name. In fact, it can be any of the things listed in the gitrevisions documentation. Generally, though, you might run git log and cut and paste the raw hash ID of the commit you want.

  2. It takes an optional flag: one of --soft, --mixed, or --hard. This tells it how far to go.

What this kind of git reset does is find the commit you specify and force the current branch name to point to that particular commit. Then, optionally, it updates the index (only) or the index and your work-tree (both).

For instance, suppose we run:

git reset --hard HEAD~1

This ~1 suffix is a relative commit operation: it means go back to the parent of the current commit. So if we have:

...--F--G--H   <-- master
            \
             I--J   <-- develop
                 \
                  K--L   <-- feature (HEAD)

then HEAD~1 means commit K, as that's one step back from L. So this tells Git: make the name feature point to commit K. The result is:

...--F--G--H   <-- master
            \
             I--J   <-- develop
                 \
                  K   <-- feature (HEAD)
                   \
                    L   [abandoned]

Commit L still exists—it's still in the big database of "all Git commits"—but it no longer has a name by which to find it. If we can't find it, we don't see it, so it seems as though commit L is gone already.9 The git log command will now start at commit K, then work backwards to J, then I, then H, and so on.

If we decide both commits K and L are bad, we can run:

git reset --hard develop

which will get us:

...--F--G--H   <-- master
            \
             I--J   <-- develop, feature (HEAD)
                 \
                  K--L   [abandoned]

We "lose" both commits this time because we made the name feature point to commit J, the same as the name develop.

The --hard here tells git reset just how far to go. When we do this git reset, Git executes 1, 2, or all three of the following steps:

  1. Move the current branch name. We pick a commit, like HEAD~2, the raw hash ID of commit J, or the name develop to pick commit J, and Git moves the current branch so that it identifies that commit.

    If we said --soft, Git stops here. The index remains untouched.

  2. Replace the index's contents with those of the selected commit. Remember, the index represents the next commit to make. It usually matches the current commit, unless we've run git add or git rm or both to update it. By making the index match the commit we just moved to, we restore this state: the index will match the current commit.

    If we tell Git not to allow this, the index will be unchanged from whatever it had in it before the git reset. If it matched commit L before, it will still match commit L.

    If we use --mixed, git reset stops here: the branch name has moved and the index is reset, but our work-tree is untouched. This is also the default, so if we don't use --hard, git reset stops here.

  3. Modify the work-tree to match the selected commit. This will remove tracked files that need to be removed (because of the index resetting in step 2), and replace work-tree files that get replaced in the index. The untracked files will be left untouched (including any untracked-and-ignored files), but the rest of the files will be replaced if needed, so as to match the commit we just moved to.

    Git only does step 3 if we use --hard.

Hence, in your particular case, you want to use git reset, probably with --mixed (or the default), to undo the commits you made but have not yet pushed. Some origin/ name—origin/master, origin/develop, or whatever—will identify the commit you want to move your own branch back to. By leaving your work-tree undisturbed, but resetting Git's index, you'll now be able to carefully git add each file, one at a time or en masse but without over-adding. Then you will be ready for your git commit.


9The way we can find it, if we want it back, is to use Git's reflogs. Reflog entries eventually expire, and the one that finds commit L will expire after 30 days, so that's why 30 days later, commit L will eventually go away, whenever git gc gets around to it. Running git reflog will dump out the reflog entries for HEAD, and git reflog name will dump out the reflog entries for the given name. Normally, though, we don't see these commits.


What to know about git status

The git status command is very useful. Before you make your new commit, here's what to know:

  • First, git status prints stuff about your current branch (on branch master for instance, and an ahead and/or behind count based on its upstream, which we haven't covered here).

  • Then, git status does a quick git diff that compares the current commit—the HEAD one—against Git's index. For each file that is the same, Git says nothing at all, but for each file that is different, Git says that this file is staged for commit.

    This therefore tells you what will be in the new commit that isn't in the current commit: new files, files that are removed, and files that are modified. You can eyeball this list and make sure that the right set of files appears here. If a file is missing, you can git add it, copying it from the work-tree to the index. If a file is there that should not be, you can git reset it, using a different form of git reset (one we didn't cover above) that copies the HEAD copy of the file to the index. In a version of Git 2.23 or later, you can git restore it.10 The git status output shows a correct command to use.

  • Then, git status does a second git diff to compare what's in the index—i.e., what you're proposing to commit—against what's in your work-tree. For the files that are the same, it again says nothing; but for files that are different, it says that these are not staged for commit. You can eyeball this section to see if you forgot to git add some file(s).

  • Last, git status tells you about any untracked file that is not also marked with a .gitignore. This section can be very long, so Git will summarize some of them sometimes; to make it list everything individually, use git status -uall, or git status -u. If this list is too verbose and is in the way, listing files in .gitignore is a good way to shorten it.

Ideally there should be few if any gripes here, so that any unstaged and/or untracked files indicate something you forgot to git add, or forgot to list in a .gitignore. If that is the case—if files showing up here should just be added—it makes adding all the files you should add very easy: git add . (from the top level) or git add -u will do the trick. This makes using Git a lot more pleasant.


10The git restore command is more capable than the old git checkout sub-commands here: you can do this with git reset or git restore but not with git checkout.

torek
  • 448,244
  • 59
  • 642
  • 775
1

Easy.

git reset --soft <last commit hash you want to save>

This will reset and “delete” all commits after the hash you provide it but will leave the files alone. To make sure, run a

git status

git log

to double check before doing a git push origin. If you never pushed those commits to your remote you won’t need to force but if you did push them at some point you’ll need to use git push -f origin <branch name>

yes-siz
  • 348
  • 2
  • 9