First, I recommend avoiding git pull
, at least for newbies. It does too many things, and by doing that, hides important information that you therefore won't be aware of. In fact, it really just does two things—it runs two different Git commands for you—and it's meant to be convenient. But sometimes it's just not (convenient).
Before we go any further, though, we need to describe how Git really works. Git is not about files—although commits contain files—and Git is not about branches, although branch names are how we (and Git) find commits. Git is, in the end, all about commits.
It's also worth mentioning that, per comments, you may be running into a more advanced situation, in which Git is deliberately introducing differences between Git's index copy of files, and your work-tree copy of those same files. But before we can discuss that properly, you need a way to get to these more advanced topics.
Git is about commits, so let's define commits
Here's a laundry list of what you need to know about commits.
Each commit is numbered. These numbers are not simple counting numbers—we don't have commit #1 folllowed by #2 followed by #3, for instance—but they are numbers, with each commit getting its own unique number. Each number is actually a cryptographic checksum of the full contents of that commit.
As a result, it's impossible to change anything about a commit, once it's made: the commit's number depends on every bit of data inside the commit. That makes the contents of a commit completely read-only. They're also mostly permanent (and we won't go into how one gets rid of a commit, here). If you take an existing commit out, change anything in it, and put it back into Git, what you get is a new unique commit with a new unique number (hash ID). The old commit still exists.
Each commit contains a full snapshot of every file that Git knew about, at the time you (or whoever) made the commit. This is the main data inside each commit: a frozen-for-all-time snapshot. The files in the commit are in a special, read-only (because of hash IDs), Git-only, compressed and de-duplicated form. This takes care of the fact that most commits mostly re-use files from previous commits: so the new commits don't really take much space after all, except for any changed files.
Each commit also has some metadata, or information about the commit itself. This includes who made the commit—name and email address—and a date-and-time-stamp that includes the exact second at which you (or whoever) made the commit. This means we can't predict what a future commit's hash ID will be, unless we know exactly when someone will make it and exactly what they'll put inside it.
Inside the metadata, Git stores the commit number of the previous commit. Merge commits are defined by having the commit number of more than one previous commit, though we won't go into any detail about how this works here.
Git can find any commit—indeed, any internal Git object (there are 3 other kinds of object)—by its unique hash ID. That is, if we know some commit's ID, Git can easily find it, if we have it at all.
What these last two points mean is that, if we have a string of ordinary commits, we can draw them like this:
... <-F <-G <-H
where H
stands in for the actual hash ID of the last commit in the chain. Commit H
contains, in its data, a full snapshot of all files, and in its metadata, the information about who made it and so on. But commit H
also contains, in its metadata, the hash ID of earlier commit G
. So Git can use this to read out G
, which also has a snapshot and also has the hash ID of earlier commit F
, which Git can use to find F
, and so on.
Git calls the previous-commit hash ID stored in each commit the parent of the commit. So children know who their parents are. But parents don't know their children, because the children don't exist yet when the parents are born, and we have no idea what their hash IDs will be.
What Git needs, then, is a quick way to find commit H
. This is where names—including branch names—come in.
Branch names let Git find the last commit
Given:
...--F--G--H
we still need a quick way to have Git find the hash ID of commit H
. To do this, Git stores hash ID H
in a branch name, such as master
or develop
:
...--F--G--H <-- master
If we have more than one name, the two names might identify the same commit:
...--F--G--H <-- develop, master
Now we need a way to know which name we are using. This is where the special name HEAD
comes in: we have Git attach the name HEAD
to one of the branch names. That's the name we're using:
...--F--G--H <-- develop, master (HEAD)
Here, we're using the name master
to get commit H
. If we use git checkout develop
or git switch develop
, we get:
...--F--G--H <-- develop (HEAD), master
We're still using commit H
, but now we have the name master
. Note that all commits are on both branches at this point. (This is common in Git: many commits are on many branches at the same time.)
To make a new commit, we do some stuff that you've seen but we're not describing here yet, then run git commit
. Git makes a new commit I
from the files it knows about. The parent of new commit I
is existing commit H
:
...--F--G--H
\
I
and the trick is that the last step of git commit
is to write the new commit's hash ID into whichever name HEAD
is attached to:
...--F--G--H <-- master
\
I <-- develop (HEAD)
New commit I
is now only on develop
, not on master
. Commits up through H
are on both branches.
Git's index and work-tree, or, files Git knows about
We've just said that each commit has a full snapshot of every file that Git knows about, but these files are in a special Git-only format. Non-Git programs on your computer can't use these files, and nothing—not even Git itself—can change them. So they're quite useless for getting any new work done.
What Git needs, and therefore has, is an area where you can actually see and work with / on your files. Git calls this area your working tree or work-tree. These files are stored in the ordinary way, so that you can do anything you like with them. This of course means there are two copies of each file, but that's pretty much necessary for any version control system: there's the frozen, committed copy, and a useful one.
Git now diverges from the way most version control systems work. Git just provides your work-tree files (from a commit). Git doesn't actually use them. Other version control systems would use your work-tree: you would run their "commit" verb, whatever it is, and they would scan through your work-tree to see what you have changed and do whatever it takes to commit that. Git doesn't do that at all.
Instead, Git keeps a third copy—well, sort of—of each file. This third "copy" is kept in Git's de-duplicated format, compressed and Git-ified, ready to go into a new commit.
Git keeps this third "copy" in what Git calls, variously, the index, or the staging area, or—rarely these days—the cache. When you first extract some commit, you get all three of these copies: the frozen one in the commit, the index "copy", and the work-tree copy. Because the index copy is de-duplicated, the ones that match the current commit don't actually take any space: there is no actual copy. But, unlike the truly frozen copy in the current commit, you can have Git replace this index copy.
The main Git command for replacing the index copy of some file is git add
. This command tells Git: Make your index copy of some file match the copy in my work-tree. You need to use this command if you have changed a file, so that the changed file gets copied back into Git's index, ready for the new commit.
Git makes new commits from Git's index
The index or work-tree therefore acts as the proposed next commit. This ignores some deeper aspects of the index—for instance, during a conflicted merge, it takes on an expanded role—but it's a good way to think of it initially: Git's index holds your proposed next commit. This starts out matching your current commit.
Hence, when you use git checkout
to pick some commit to work with, Git fills both its own index and your work-tree with the files from that commit. From this point on, your work-tree is yours to fiddle with, but at some point, you need to tell Git to update Git's index, so that Git knows about the updated files. The git commit
command just takes whatever is in Git's index at that time, to make the new commit.
Comparing commits: git diff
Before we jump into git status
below, let's pause briefly to consider how commits hold snapshots, yet Git shows you changes. How can this work?
The answer is simple enough. Given two snapshots—such as that in a parent commit and a child—Git just extracts, to a temporary area, both snapshots. Then, for each file in the two snapshots, Git compares the files' contents. If they are the same, Git need not say anything about this file (and in fact, given the way Git stores files inside commits with de-duplication, it knows in advance if they are the same and doesn't even need to extract them at all). If they are different, Git uses a difference engine or Text comparison algorithm to figure out some sequence of operations that would change the old file into the new one. That's a diff and is what you see from git diff
or git show
(though git show
adds information about the commit, too).
git status
Dealing with these three-copies-of-each-file can be messy. But most of the time, most of the copies are the same. If Git would just run a diff ... well, that's what git status
does:
First, git status
compares each file in the current (HEAD
) commit to each file in Git's index. When the two files match, Git says nothing. When they're different, Git prints the name of the file as a file staged for commit.
Then, git status
compares each file in Git's index to that same file in your work-tree. When the two files match, Git says nothing. When they're different, Git prints the name of the file as a file not staged for commit.
This means that, for a not-staged-for-commit file, you could run git add
to copy that file into Git's index. Now the index and work-tree copies will match and the file won't be listed in the second set of files—but of course, there's a good chance it will now be listed in the first set of files, the "staged for commit" ones.
This also means that anything in the first set of files, "staged for commit", would be different in a new commit. The index copy of that file doesn't match the HEAD
copy. That's useful information.
Note that one file can appear in both listings. If you:
- modify a work-tree file, then
- copy the updated file to Git's index, and then
- modify the file some more,
you'll have all three copies different, and each of the two git diff
steps will find a difference.
Linear development (no merges) and git merge
We showed above how we can make a new branch name like develop
and start making new commits:
...--G--H <-- master
\
I <-- develop (HEAD)
Having made several new commits—by modifying work-tree files, using git add
to copy them to Git's index, and running git commit
—we might now have:
...--G--H <-- master
\
I--J <-- develop (HEAD)
If we now run:
git checkout master
we will adjust our work-tree, and Git's index, to make commit H
the current commit, with HEAD
attached to master
:
...--G--H <-- master (HEAD)
\
I--J <-- develop
Commits I
and J
still exist, and Git can find them using the name develop
, which locates commit J
. Commit J
stores commit I
's hash ID as J
's parent, and I
stores H
's hash ID in turn—so all commits up through H
are on both branches, and then I
and J
are only on develop
.
If we now run git merge develop
, Git will notice that we don't have any of our own commits exclusively on master
. Instead of doing a true merge—which we won't cover here—our Git will do a fast-forward operation. Essentially, our Git will now just check out commit J
directly, while also dragging the current branch name forward like so:
...--G--H
\
I--J <-- develop, master (HEAD)
(There's no reason not to straighten out the drawing now, I just haven't bothered.)
git fetch
, remotes, and remote-tracking names
Git gets most of its real power through distributed repositories. (Well, that and merging, but these days merging is often driven via distributed repositories.) To distribute a repository, we basically make copies of it.
Each new copy shares the existing commits, with their big ugly hash IDs. That is, every Git everywhere uses the same cryptographic algorithm, so that two Gits that have a given commit (which contains the same data and metadata) will agree that that commit gets that hash ID. This means two Gits that haven't met at all before, or have been updated since they last met, can just exchange hash IDs, to see if they have the same commits.
Each new copy, though, gets its own branch names. Since Git just uses these names to find the last commits, that's OK! We generally don't remove any commits ever; Git is instead built to add new commits.
So, suppose two Gits were hooked together earlier, and have the same commits (and often the same branch names too), and one of the two just got some new commits. We can hook the other Git—the one that needs the new commits—back up to the one that got the new commits, and grab those commits from the more-advanced Git.
This is a git fetch
operation, and it even handles the case where both you and they have made new commits. Suppose you both started out with:
...--G--H <-- master (HEAD)
Since then, you made two new commits, which for no obvious reason yet we'll draw like this:
I--J <-- master (HEAD)
/
...--G--H
Suppose that they too made two new commits. Because their name and email address are different and/or they made the new commits at a different time, their new commits have two more hash IDs, which we'll just call K
and L
. You now have your Git call up their Git and get, from them, their new commits:
git fetch origin
where origin
remembers the URL for their Git. This short name origin
is a remote, and remembering that URL is one of its main functions. (If you have more than one other Git you talk to, you can have more than one remote.)
This gives you:
I--J <-- master (HEAD)
/
...--G--H
\
K--L <-- ???
This drawing should be clear enough: they made commit K
such that its parent is H
, like your own commit I
. They then made commit L
such that its parent is K
. Your Git saw that their master
named commit L
, and got commit L
from them. Your Git saw that the parent of L
was K
and got commit K
from them too. Your Git saw that the parent of K
was H
, and you already have H
(and everything earlier), so your Git knew that was all it needed.
But: your Git uses your name master
to find commit J
. What name will your Git use to remember hash ID L
, the latest commit on their master
?
This is where remote-tracking names come in. Your Git takes their name master
and renames it to your own origin/master
, so that in your Git, you get:
I--J <-- master (HEAD)
/
...--G--H
\
K--L <-- origin/master
In this kind of situation, you would normally now need to use git merge
or git rebase
to combine your work—your new commits—and theirs. But let's assume now that you didn't make any new commits, so that instead of the above, you have the simpler:
...--G--H <-- master (HEAD)
\
I--J <-- origin/master
(we're using I-J
for their commits now, since you didn't use up the letters this time).
Now we have that same situation in which your git merge
can perform a fast-forward instead of a real merge. You can run:
git merge origin/master
and get:
...--G--H--I--J <-- master (HEAD), origin/master
Your Git and their Git are now back in sync: your Git and their Git have the same set of commits, ending at commit J
, and your master
and their master
—which your Git remembers under the name origin/master
—both identify commit J
.
git push
is like git fetch
, only different
Suppose you have new commits and some other repository doesn't. You can, in this case, run git push
to send your commits to them, rather than going on to the other machine that has the other Git repository and running git fetch
.
There are several key differences here though:
- Of course,
git push
sends commits rather than receiving them.
- At the end, though,
git push
does not use a remote-tracking name. Instead, your git push
will ask them to set some name (or names) of theirs, typically one of their branch names.
This last point really messes with everything. In particular, it usually means that you cannot push to a Git repository that has a work-tree. You must push to what Git calls a bare repository. Without defining that yet, let's see why this is a problem.
Suppose you and they both start out with:
...--G--H <-- master (HEAD)
as usual. You make your new commits and run:
git push origin master
The origin
here supplies the URL as usual, and the master
here supplies two things:
Your Git needs to know which commit(s) to offer to their Git
. Your master
here supplies the hash ID for your latest master
commit.
At the end, your Git needs to know which name to ask them to Set. Your master
here supplies the name your Git will give to their Git. (You can make your Git supply a different name using git push origin master:newname
, for instance.)
So, you send some new commits I-J
to their Git, and ask them to set their master
to identify commit J
now. But if they do, what happens to their Git's index and their work-tree?
If they have commit H
checked out right now, and someone is over there working on it, it would be awfully rude, at the least, to replace their work-tree files while the guy is editing them.
If their Git doesn't update their work-tree and their Git's index, though, they'll still be working with commit H
. If their Git updates their master
to identify commit J
and they go to make a new commit, they'll put everything back to the way it was at the time commit H
existed:
...--H--I--J--K <-- master (HEAD)
Their commit K
will essentially undo everything you did in I-J
.
Git's answer to this is to refuse to update their branch master
. Their Git just rejects your attempt to push, saying that you cannot push to the checked-out branch.
There are options, in modern Git at least (not in old versions like Git 1.7 or 1.8 as found in some distributions), to change this sort of behavior, but you should not set them unless and until you understand the above. So usually, on a server that receives git push
actions, we use these so-called bare repositories. A bare repository is simply a repository that has no work-tree. With no work-tree, no one can be working there, and the situation described above simply never occurs.
push, fetch, and ... pull?
If the opposite of push
is fetch
—and in Git, that's as close as we get to opposites here—what is git pull
, exactly? The answer is a little complicated because of history: git push
and git pull
actually predate remote-tracking names, and in those days, git fetch
was of limited use.
Typically, right after a git fetch
, if you've gotten some new commits on whatever branch or branches you care about, you will need to run a second Git command to actually incorporate those new commits. This second command is usually one of git merge
or git rebase
. So that's what git pull
does: it runs git fetch
, then it runs a second Git command.
There are several problems here though:
Which second command is the right one? If you know in advance, that's OK, but what if you don't? (Then git pull
is the wrong command to use!)
What if you'd like to look at what came in first, before running any more commands? (Then git pull
is the wrong command to use!)
What if something goes wrong? Well, if you know all about git pull
and the two commands it runs, you can see which part(s) went wrong, and know what to do to fix things. But if not ... well, git pull
was the wrong command to use.
For these reasons, I tend to avoid git pull
myself: I run git fetch
, see what happened, then decide whether I want to run git merge --ff-only
, git merge
, git rebase
, or something else entirely.1 Still, if it does what you want, feel free to use it: just be aware that it's shorthand for multiple steps.
1I could probably use git pull --ff-only
more regularly, but I tend to view git pull
itself as a bad habit, and back in the Dim Time of early Git, I got burned by bugs in git pull
, so I just find it overall inconvenient. It's supposed to be more convenient, but I find it less convenient, so I just don't use it.
Your situation
You actually have three Git repositories involved:
What I don't understand is why there are "local changes". I didn't make any local changes on the AWS git. Just on my local machine, which then pushed to Gitlab and pulled to AWS.
So, it sounds like you:
- have a bare repository on GitLab (no work-tree, safe to push)
- have a non-bare repository on AWS
The changes you are seeing on the AWS system must be "changes" (differences, really) in your work-tree and/or index. Use git status
to find out what files these are in, and then figure out how they came about.
Whether or not these are coming about because of .gitattributes
files and CRLF line endings (as in `git` shows changed files after cloning, without any other actions), you'll need to pay extra-close attention to the three copies of each file:
The committed copy, in Git's de-duplicated internal format, comes from the index copy, or from some earlier commit. Once committed, these copies are frozen for all time, regardless of what actual data they contain.
The index copy is also in Git's de-duplicated internal format. It came out of some commit at some point, and then if/when you used git add
, Git replaced that with a new copy generated by reading the work-tree copy, compressing it, and Git-ifying / de-duplicating the file.
Your work-tree copy on the AWS machine was produced by Git expanding the compressed, de-duplicated index copy, and/or was replaced by whatever you might be running on the AWS system that might overwrite your work-tree copy.
Each of the copy steps, from index to work-tree (expand) and vice versa (compress and Git-ify and de-duplicate), can insert arbitrary changes. Some of these are line ending transformations, as directed by .gitattributes
and core.autocrlf
and so on. Others are defined by filter drivers, which generally require coordination with your .git/config
file.
It's hard to see what's actually in a committed file or the copy in Git's index, but you can use git cat-file -p
to access the raw data. To guarantee that you're seeing the raw data untransformed (i.e., with no attributes or filter drivers getting in the way), you can use the raw hash ID of the internal blob object.2 So, use:
git rev-parse HEAD:path/to/file
or:
git rev-parse :path/to/file
to find the blob hash ID of the given file. This hash ID is unique to the file's contents (which means that if multiple different file names have the same data, Git stores the data only once: that's the de-duplication in action). Using git cat-file -p
on the hash ID will write that data to stdout. Be careful on Windows, where stdout itself is subject to mutation;3 but on Unix-like systems, redirecting stdout to a (regular, ordinary, every-day) file means you can use whatever tools you like on the file to see the raw data.
2Note that git cat-file -p
works with names too: git cat-file -p HEAD:path/to/file
, for instance. The program shouldn't make any transformations to this data based on the file path unless you specifically ask for it, but by using the raw hash ID, we can guarantee that it can't do that, as it has no idea what the path name was.
3I'm not entirely sure how this works, but note that PowerShell, for instance, can affect the stdout encoding: Changing PowerShell's default output encoding to UTF-8