Why Commiting one file, but all files are committed in RStudio?

Question

I have a repository on Github with some files and I have a folder on the RStudio server with some files as well. I have made some changes on some files, but I just want to commit one file (test.Rmd) only using the command

git init
git add test.Rmd
git commit -m "Adding some plots" 
git push

Instead off committing this file only it commits all files in the folder where the file test.Rmd is. Why is that happened? I tried doing the exact same things for one other file in different folder and the committing worked. Before this, it seems I did something already such as

git init
git add .

that is why it added all files on the directory?

Using the git status results

I suspect now that the problem is cancelling adding these large files? I think I committed all changes of all files without knowing I did it when I use git push in the end.

How are you confirming that more than one file is in the commit? — TTT, Aug 21 '21 at 19:36
@TTT from the terminal the output showing many objects and ended up saying there are large files detected thatss why the commit was failed. The file I commit here is just the simple script without any relation with those large files. I probably accidentally add all files in this folder before commiting? But doing the `git reset` does not solve it — MK Huda, Aug 21 '21 at 20:41
If you think you may have done something different than the commands you wrote in the question, then it's possible you committed them all. Is it the first commit in the repo that has too many files? If yes, it may be easiest to just delete the entire .git directory, and start over. Side note, you said you already have a repo in GitHub. It's not clear if what you're doing locally is related to the existing repo, which may already have many more files... — TTT, Aug 21 '21 at 20:54
@TTT Where can I find the .git directory? in RStudio in the directory that I connect to Github, I only have .gitignore and other file types such as .Rmd, .R, .h5ad etc. I could not chech the history unfortunately. But I remembered I did `cd project (the folder name), git init, git add ., git remote add origin https://github.com/username/repository.git. (copy from https Github the repository I made already where I want to push the commit)`. You mean I need to delete the .git from the Github? — MK Huda, Aug 21 '21 at 21:15
Try using git status just before committing to see what will be included in the commit — David Plumpton, Aug 21 '21 at 22:21
@DavidPlumpton I tried it already and it showed that the other files I do not want to commit are still included. — MK Huda, Aug 22 '21 at 12:40
Strange! Is it possible you have a Template Directory defined as described here https://git-scm.com/docs/git-init — David Plumpton, Aug 22 '21 at 20:19
@DavidPlumpton Using Git commands from terminal is something new for me. I used to use committing directly from the Git menu bar on the RStudio. I think my mistake was using `git add .` so it added all files in the directory and `git push -u`. — MK Huda, Aug 23 '21 at 09:01

score 1 · Answer 1 · answered Aug 22 '21 at 03:06

Your question is ambiguous at best, and contains some bad assumptions, so this answer is long.

Some background about Git commits and `git init`

All commits in Git always contain all files. That's how Git itself works.

Running git init will either:

create a new, empty Git repository in the current working directory, or
re-initialize the existing Git repository wherever it is.

You get the second behavior—re-initializing the existing Git repository—if Git sees that you are in some existing Git repository. The output of git init tells you which one it did:

$ git init
Initialized empty Git repository in [path, redacted]
$ git init
Reinitialized existing Git repository in [path, redacted]

Except for some special cases that almost certainly don't apply to how you're using Git, the "reinitialization" variant doesn't really do anything at all: your existing repository remains unchanged.

When git init creates a new, totally-empty repository, there are no commits and therefore no branches yet. The next commit you make is thus the first commit ever. This commit is a bit special: it is a root commit, with no history. It contains whatever files you tell Git to have it contain, using git add.

After this point, though, you have an existing Git repository with existing commits. This includes the case where you use git clone to copy some existing repository (e.g., from GitHub) to a new Git repository on your own machine (e.g., your laptop). You will tell Git to check out some particular commit—usually, the tip commit of some branch name—which means Git will fill in both its staging area and your working tree with all the files from that commit.

Subsequently, you'll edit some files and maybe even create some new ones. You then run git add on one or more of these files. If you're git add-ing a file that already exists in Git's staging area, Git tosses out the old copy from its staging area and overwrites the staging area copy with a new copy made from your working tree. Or, if you git add a totally new file, Git copies the file into its staging area, as a new file.

In all of these cases, all the existing files in the staging area remain there. Your next git commit takes all the files that are in Git's staging area, and makes a snapshot from them.

A concrete example

Suppose you have an existing repository where the main branch (whatever its name is: GitHub now encourage people to use main while older repositories tend to use master) has ten files in its most recent commit. You git clone this repository to your laptop, so your laptop Git software ("your Git") checks out this last commit, extracting the ten files into Git's staging area and your working tree.

You now change five of the ten files in your working tree, but run git add on only one of the five updated files. This means that your Git's staging area has ten files in it: nine files match the one from the current commit and one matches the updated file in your working tree. Four staging-area files differ from their four working-tree counterparts; the remaining six staging-area files match their working-tree counterparts.

If you now run git commit -m haaaaaands, you get a new commit containing the ten files exactly as they appear in the staging area right now. You still have all the updated working-tree files in your working tree, but the staging-area copies still match the previous commit's copies, so the new commit's copies match the older commit's copies, except for the one file on which you ran git add.

The new commit you just made becomes the current commit, which is now the most recent commit in your laptop's repository on the current branch. You can now use git push to send this commit to the GitHub repository; if and when you eventually do that, the commit they receive will match, bit-for-bit, the commit your Git stored in your laptop repository. It will have the 9-files-that-match-one-file-that-doesn't situation; the commit they get will have the previous commit as its parent; and so on.

Things to know about `git status`

First, git status tells you things about your current branch. It will say something like on branch main. This is your Git telling you that your laptop repository has main as the current branch. Your Git may also tell you that you are "ahead" and/or "behind" some other name, such as origin/main: this uses information stored entirely locally, on your laptop. This information may be out of date, depending on how active the other Git repository, over on GitHub or wherever it is, may be.

Next, if you're not in the middle of a conflicted merge—if you are, the rest gets more complicated—the git status command runs two comparisons:

First, it compares the files in the current commit to the files in the staging area. Some of these files will usually match exactly, since you didn't do anything with them since the time they were extracted from some commit. For those files, your Git says nothing at all.

Other files in the staging area won't match your current commit, because you ran git add on them for instance. In this case, your Git will say that these files are staged for commit. That simply means that the staging area copy differs from the current commit's copy in some way.

Note that some files in the staging area may be new. That is, those files do not exist at all in the current commit. For these files, Git will say that these are "new files".
Having listed files "staged for commit", or not found any files to list, your Git now goes on to compare the files in the staging area to the files in your working tree. As before, some files may match. Other files might be different—and there might even be files in your working tree that have no counterpart at all in the staging area: files that are new, as before.

This time, though, your Git will only tell you about changed files, saying that such files are not staged for commit. It does collect up a list of each of the new files as well, but holds off on them for until the next part.
Having listed any files "not staged for commit", your Git goes on to tell you about untracked files. These are any files in your working tree that aren't in Git's staging area. In other words, these are "new" files.

The thing that's weird about these is how they're separated out, into "untracked", as a separate category. The reason for this is that the Git authors expect a very large number of untracked files that should not be reported here. Git in particular is built to work with compilers that create "object files" and other "build artifacts" that, while they may be important, should not be added to commits and thus saved forever.¹

To this extent, Git has an exclusion facility, via .gitignore and other exclusion files. Here, you list files that Git should just shut the ____ up about. It should not complain that these untracked files are untracked. Moreover, when these files are untracked, you can use an en-masse git add operation, such as git add ., to add all untracked files ... except for those marked "ignore".

What's misleading about .gitignore is that it will not ignore any file that is tracked. The word tracked here is defined in terms of the opposition of the definition of untracked. An untracked file is a file that exists in your working tree, but not in Git's index. A tracked file is one that is in Git's index, whether or not it exists in Git's index. A tracked file is never ignored.

Good maintenance of .gitignore files makes Git much more pleasant to use: git status tells you only useful things; git add . adds only the correct things.

¹The reason for this is that the build artifacts are—at least, ideally—completely reproducible from the original sources. We want to save only the originals, not the derived work-products. That saves—at least potentially—enormous amounts of space and time and human work later. Note that there is a lot of "ideal" and "potential" here. These things don't always work out as planned, and sometimes it's actually reasonable to save everything ever. Git isn't so great at that, though, so you probably don't want to use Git for that purpose.

Possible sources for "all files always committed"

If you run git add ., you are telling Git: scan my current working directory, find all updated files and all new files and any removed files, and use git add on each one to update your staging area copies. The only exceptions here are files listed in .gitignore or other exclusion files, that are not already tracked.

If you run git add *, the behavior depends somewhat on your command line interpreter: Unix-style CLIs (such as bash or zsh) have the shell expand the *, while MS-DOS style CLIs (such as CMD.EXE) pass the literal asterisk * to Git, which then expands the *. I won't go into all the details of the difference here, but this tends to do an en-masse add of a lot, or all, files, depending on the many details.

If you run git add -u, you tell Git to find updated files and add them.

You can have a pre-commit hook. Hooks in Git are rather complicated, but some software installers will not only install Git for you, but also set up some sort of automatic hook creation. (This is the kind of setup where the reinitialization of a Git repository can have an effect, although for it to do so, the installer has to put those hooks into a Git "template", which seems to be used rarely if ever.) A pre-commit hook can, depending on how you run git commit, run git add for you, even if you don't want it to.

If you run git commit -a, you are in effect telling Git to run:

git add -u
git commit

There's an interaction here with pre-commit hooks, so the two-command sequence is not exactly the same, but this could be the source of your problem.

It is really useful information. brilliant. But I still do not how to cancel adding files that I do not want to commit? I think I accidentally add all files in the directory and the Git detects all changes and when I try to commit one file, other files are still included. — MK Huda, Aug 22 '21 at 12:39
The `git status` output tells you how to "un-add" a file (using `git reset` or `git restore`, depending on your Git version). — torek, Aug 22 '21 at 13:24
This is what I got [git status](https://drive.google.com/file/d/1N-GLoOkWPkK8Q4PjB5Mee-jYfhtEOAka/view?usp=sharing) Also, the problem is that it cannot push because there are two large files, such as data.h5ad and data.h5Seurat (that I actually do not want to add or commit and push them to Github) so I want to cancel adding these large files so that I can commit other files. Is using git rm data.h5ad data.h5Seurat should work? This is the latest condition from my git status [git status](https://drive.google.com/file/d/114t9O0A5mjBpDmeRGujNM6zYR2AjJU6k/view?usp=sharing) — MK Huda, Aug 22 '21 at 15:28
I don't have permission to access your Google drive (nor do I want it) - you should cut-and-paste the `git status` output into your question, if it's relevant. *Also, the problem is that it cannot push because there are two large files, such as data.h5ad and data.h5Seurat (that I actually do not want to add or commit and push them to Github* If that is the actual problem, why did you ask about a *different* problem? (See [this](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem).) Note that there are existing SO questions about removing already-committed large files. — torek, Aug 23 '21 at 00:05
It is the same problem. Those large files included because all files in the folder were unintentionally added. My question was how to cancel adding, especially those large files but also applied to other files I do not want to commit. — MK Huda, Aug 23 '21 at 08:50
If they are *already committed*, you have a very different problem! Note that `git push` pushes *commits* (which, at this point, you must already have), hence the different problem. — torek, Aug 23 '21 at 08:50
Oh no. I do not know about it. So sad. I just trying to learn using terminal commit to do commit and this issue happened. I think I did commit already then the failed happened when it detected the large files. Since I am using the RStudio server, I cannot install the lfs for large file storage. I think it must be the admin. Is there any way then? — MK Huda, Aug 23 '21 at 08:56
You have some choices: build new commits that don't contain the large files (and drop the old commits that do contain them), with the negative side effect that you're not getting these files saved, if you want them saved; install LFS and use the LFS migrate tools (GitHub's LFS support exists but when we investigated this at a previous job it was rather thin until you paid for it); use something other than GitHub. — torek, Aug 23 '21 at 09:00
Git-LFS is a set of wrappers for Git. You don't *have* to be an admin to install them; being an admin just makes it easier since then it's always there for all Git repos (as a non-admin you have to do more work). We ended up not using Git-LFS though as we didn't see enough value for the price. — torek, Aug 23 '21 at 09:01
Could you give the link about how to build a new commit? I just do not want to google it and get the wrong information and will make another mistake again please. I followed this link [lfs](https://git-lfs.github.com/) but I could not run the `git lfs install` — MK Huda, Aug 23 '21 at 09:11
The usual way people do this seems to be to use BFG or `git filter-branch`: see https://stackoverflow.com/q/2100907/1256452 and also https://stackoverflow.com/q/68477232/1256452. I recommend doing this on a *copy* of a repo (keep the old ones around until you're comfortable with the new ones; treat them as completely separate projects). — torek, Aug 23 '21 at 09:21

score 0 · Accepted Answer · answered Aug 24 '21 at 14:37

I have solved this issue by using the Git lfs for Large files. Since I am using the RStudio server, I asked the admin to install the Git lfs then I do these

git lfs install
git lfs track "*.h5ad, *.h5Seurat"
git add .gitattributes
git lfs migrate info
git lfs migrate info --everything
git lfs migrate import --everything --a #override changes in your working copy? 
[Y/n] Y

Then it will push the commit to Github. Note that *.h5ad and *h5Seurat is the large file extensions that I want the Git lfs to handle. I am following this link git lfs

Why Commiting one file, but all files are committed in RStudio?

2 Answers2

Some background about Git commits and git init

A concrete example

Things to know about git status

Possible sources for "all files always committed"

Some background about Git commits and `git init`

Things to know about `git status`