-1

I started to work with git, I working about remote repository and I clone this repo to my local. And I read that before that I do changes in the local repository I need to update my local repository (in according to the remote repository) because otherwise there may be conflicts.

I trying to understand why I really need to do it and how can do it (with which command in git) ?

QmaQ
  • 1
  • I think there are already existing posts talking about this topic: https://stackoverflow.com/a/1783426/3237248 – OscarDOM Sep 05 '22 at 20:38

1 Answers1

0

This isn't a great fit for StackOverflow, because it's really a question about using a distributed system (where things are happening both on your own computer, and on other computers) and/or asking for a tutorial on how to use Git. It therefore runs afoul of the "overbroad" issue and the "opinion" issue. Still, let's see what I can do here:

And I read that before that I do changes in the local repository I need to update my local repository (in according to the remote repository) because otherwise there may be conflicts.

First, you don't need to update your local repository. You may well want to update your local repository. Second, updating your local repository does not—can not—avoid conflicts, and conflicts are not evil. The reason to do this is to see what others have already done and thereby take their own already-written code into account as you write your own code. But you and they will be working at the same time and in some cases this will result in future clashes.

If everyone else is highly active, though, you run the risk here of perpetually reading what others have done and never writing anything of your own. If you are going to write something of your own, at some point you must stop chasing others and get to work. If the repository you've cloned is very in-active, this may well occur naturally: you'll go to pick up new work and there is none, and therefore there's nothing in your way. If not, it's your decision. There are middle grounds, such as discussing ongoing tasks with colleagues or coworkers (via some other communications channel: meetings, Slack, whatever). Use these as you see fit.

When it comes to actually obtaining updates, though, remember that Git's way of dealing with this is commits. A Git repository is mainly a storehouse—a database—of commits (and other supporting objects, but they exist to support the commits). Each commit has a unique number, and when I say unique, I mean unique: not sort-of-unique or mostly-unique. Given two separate Git repositories, they can look at each others' commit numbers. If repository R1 has commit a123456... and repository R2 also has commit a123456..., those are the same commit, because there is only one commit anywhere in the universe with that number. If R1 has some commit, then either R2 lacks a commit with that number, or R2 has that commit.

Git does this unique-number-assignment trick whenever you (or anyone) make a new commit, without talking with any other Git repository at the time. This is the true magic behind Git: that it can somehow assign a unique number to every new commit. In fact, this trick is mathematically impossible (see the pigeonhole principle) and Git will someday fail, but the sheer size of the commit-numbering space puts that day far enough into the future that we hope it will never happen in any of our own lifetimes. (It can be deliberately, maliciously broken; see How does the newly found SHA-1 collision affect Git? In reality this is still impractical.)

I [am] trying to understand why I really need to do it ...

See above.

and how can do it (with which command in git)?

The most basic command here is git fetch. This command instructs your Git software to call up some other Git-implementing software. Your Git software looks at your repository, which contains commits. Their Git software looks at their repository, which also contains commits. Any commits that they have, that you lack, that they're willing to hand over—normally that last part is "all of their commits", but you don't control them here; you're depending on cooperation—they do in fact hand over, and your Git stuffs those new commits into your repository. You now have everything you had before, plus all of the commits they handed over.

So, now you have everything they have (assuming they handed everything over) plus anything you had that they didn't. Your repository is now the Best Repository Ever! It has everything!

And that's how Git works. Everybody's repository is the best one. You just run git fetch and now your repository is the best one.

Later, you may find out that yours isn't the best after all—or at least, they don't think so. But there's nothing you can do about that now, as long as you used git fetch. So that's the thing you do.

But git fetch just obtains commits: what good is that?

Well, first, git fetch does in fact obtain commits, but it doesn't only do that. Having obtained their commits, your Git records their most recent commits, as recorded by their branch names, under your own remote-tracking names.

To understand this, you need to know what branch names are and do for you, and to understand that, you need to know what commits are and do for you. This gets us back to the fact that Git is really all about commits.

We'll get back to this in a moment, but let me add here one point about git clone.

Clone = make new empty repository, then fetch, then check out

... I clone this repo to my local.

The git clone command that you used is actually a kind of convenience command. It runs up to five other Git commands for you, plus one non-Git command to make a new empty folder:

  1. It runs whatever command on your system makes the new empty folder. The rest of the commands run in this folder.
  2. It runs git init, to create a new, totally-empty repository. This repository can hold commits and other objects, but at this point it has nothing at all in it.
  3. It runs git remote add origin url, where the URL is the one you gave on the command line. This creates one remote, using the usual standard name, origin. A remote is a short name under which Git stores a URL, so that Git can repeatedly git fetch from that URL for instance. We saw above that you'll want to git fetch reasonably often, so that's what this is about.
  4. It runs any other configuration operations you specified with your git clone command (usually, none).
  5. It runs git fetch origin. This is how you get all the commits the other repository has. Note that, as we'll describe in more detail later, you get one remote-tracking name for each of their branch names. You still have no branch names as yet.
  6. Last, using the -b option you gave on the command line, your Git software creates one branch name in your repository. If you didn't use a -b option, your Git software asks their Git software what name they recommend. Whatever name gets used here, that's the branch name your Git creates at this point, and that particular branch name specifies one particular commit and that's the commit you get checked out at this point.

So your new repository has all the same commits as their Git repository, but until step 6, your new repository has no branches (no branch names—this gets into the problem that Git badly abuses the word "branch", to the point where it loses all its meaning).

It is in fact possible to use Git without using branch names at all. It's just that doing so would make you miserable and/or extremely frustrated. This is not a good idea! So step 6 creates one name so that you can use Git while being slightly less miserable / frustrated. (Use of Git tends to create a lot of misery and/or frustration, at least initially. )

Commits

To use Git, you need to understand commits; to be less-frustrated, you need to understand how Git uses branch names. These tie into each other, but let's start with just the commits.

We already know that every commit is numbered, with a unique but random-looking number. These are expressed in hexadecimal, e.g., 9bf691b78cf906751e65d65ba0c6ffdcd9a5a12c. Git calls this a hash ID, or more formally, an object ID or OID. Commits are one specific type of object, each of which gets one of these IDs.1 Git stores all its objects in a big key-value database, and therefore needs the hash ID of a commit, so that it can fish that commit out of its database.2

You might think, then, that you have to memorize hash IDs. Quick, what was the one in the last paragraph? It started with 9bf, what was the rest of it? Did you go back and look? What happens if you try to type it in? That's insane, right? You could grab it with the mouse, command-C or control-C to copy it, and paste it later (that's how I did it), but trying to memorize these things is just crazy. Luckily, you don't have to do it. We'll see how in a moment.

Now, inside each commit, there are two parts: there's some metadata, which is information about the commit itself, and there's a full snapshot of every file. The metadata include things like your name and email address (assuming you made this particular commit). The data—the full snapshot—is stored indirectly and uses a clever compression scheme, in which the files that make up the commit are stored compressed—sometimes highly so—and, crucially, with their content de-duplicated. What this means is that if you make six commits in a row, and change just one file each time, all but the one file get re-used in each of the commits. Only a changed file has to be stored again, and even then, if one of your changes is to put the file back the way it was, the updated "went back to how it was" file takes no space, because it's already stored.

What this means is that every commit stores every file, for all time—or more precisely, for as long as that commit continues to exist. But because files are de-duplicated (and everything gets compressed), commits are usually tiny.3 Still, they act like permanent archives of every file, saved in the form it had when you (or whoever) made that commit.

Meanwhile, inside the metadata for any one given commit, Git stores the raw hash ID of a list of previous commits. Most commits—the ones Git calls "ordinary commits"—have exactly one previous-commit hash ID in this list. Commits that list two (or more, but "more" is mostly for showing off) previous commit hash IDs here are merge commits. And, in any non-empty repository, there's at least one—and usually only one—commit that has an empty list of previous commit hash IDs: that's the very first commit ever, which Git calls a root commit.

Most commits are ordinary commits, with one previous commit listed. We can draw these commits pretty easily. Imagine that we have the most recent commit, whose hash ID is some big ugly random-looking number, but we'll call it H for Hash. We will draw it like this:

            <-H

That arrow sticking out of commit H represents the stored hash ID. We say that this stored hash ID makes H point to the earlier commit. Git calls this earlier commit the parent commit of H. Let's draw it in, calling it G since G comes before H:

        <-G <-H

Hey, look at that: G is an ordinary commit too! It points back to some still-earlier commit. Let's draw that one in as F:

... <-F <-G <-H

This goes on forever, or rather, all the way back to that very-first-commit-ever, which presumably is commit A in our drawing, so that the full chain of commits in this repository is:

A--B--...--G--H

(where I've gotten lazy about drawing the arrows in the correct—i.e., backwards—direction, and just made lines to connect commits; we just have to remember that commits only link backwards).4

Well, so what? Actually there's a very important so what here: What this means is that to find every commit, we only need to memorize the last commit's hash ID. Since H is (currently) the last commit, we just write down H's hash ID somewhere, so that we can have Git find H quickly. Git can find all the earlier commits by working backwards, one hop at a time, from H.


1The other three types of objects are blob, tree, and annotated tag. You will not normally use these directly so you don't need to remember this; you just need to remember that hash IDs or OIDs let you find specific commits, and git log will spill out hash IDs, or sometimes abbreviated versions of hash IDs (git log --oneline for instance). The commit hash IDs are the "true names" of the commits.

2Git has various maintenance commands, which ideally you'll never have to use, that can crawl through the entire database, but in a big repository this can take many minutes. You definitely don't want to be doing that all the time!

3At least, they're tiny if you use Git the way it's intended to be used. Binary files tend to be a poor fit; pre-compressed binary files, like video files or large JPG images, are right out. Avoid putting these into Git repositories if that's possible. (Small JPG images, a few hundred kbytes or mbytes or whatever, are not really much of a problem, especially with multi-terabyte disk drives now, but a small laptop SSD with only 250 or 500 GB available might become problematic.)

4The reason for this backwards linkage is straightforward: all Git objects are immutable. Git needs this property to make the numbering system work. After all, let's say you got some commit like H from some other Git repository. Your Git software and their Git software agree that H is the right number for this commit. If they change the content of the commit, suddenly your Git and their Git have different commits with the same number, and that's not allowed. The commit hashing trick, that makes Git work, therefore forbids this.

When we—or whoever—made commit G, we knew what F's hash ID was, because F existed. But H doesn't exist yet, and H's future hash ID depends on everything about H, including the hash ID of G and the exact date-and-time-stamp for when we, or whoever, make H, and we don't know either of these in advance. So we can't put H's future hash ID into G. We can put F's past hash ID into G because F exists (and is immutable!). So G can point back to F. Later, once we make H and find out its hash ID, H can point back to G, which exists by then and we know its hash by then.

So the arrows have to point backwards. Children remember their parent's names, but the parents are necesssarily born before the children are born, and the children's names don't exist yet and the parents are frozen for all time as soon as the parents are born.


Branch names find commits

This brings us to (our) branch names. Let's say we have some latest commit H. We assign some branch name to hold that commit's hash ID. Rather than writing down the commit hash ID on paper, or on a whiteboard, or even in some file, we have Git write it down in a secondary key-value database, indexed by names like branch names.

Let's pick the branch name main. Technically, internally within Git, this is a full name, refs/heads/main. It stores one hash ID (in the names database) and that's commit H, the latest commit:

...--G--H   <-- main

To make a new commit, we:

  • have Git save a snapshot of all files;
  • have Git add some metadata, including our name and email address and the current date-and-time and a log message and so on, and—important for Git—the actual hash ID of commit H, which Git can find in the name main;
  • write all that out and obtain a new unique hash ID, which we'll call I.

New commit I will point backwards to H. Since I is now the last commit, our final step of git commit will be to have Git write I's hash ID into the name main, so that main points to I instead of H:

...--G--H--I   <-- main

And that's really all there is to it—except for the fact that the first step ("have Git save a snapshot of all files") is actually quite tricky. We won't cover that here though.

More than one branch name

Suppose that instead of just making a new commit I right away, we start by first creating a new branch name. That is, we have:

...--G--H   <-- main

We're using commit H, with Git having stored H's hash ID in the name main.

We now create a new branch name like feature or topic or develop or whatever. Since a branch name must store a (single) commit hash ID, we have to pick one of the existing commits—any one will do—to put in the new branch name. Which one should we pick? Well, we probably want to start with the latest, which is H, so let's pick H:

...--G--H   <-- main, test

Now we have two names that both point to H.

One of the interesting things about Git is that now, all the commits are on both branches. They were on just one branch, but now they're on two branches. Nothing has changed about any of the commits except that somehow now they're on two branches instead of one.

In fact, if we were to delete both names—this is a little tricky to do, but it is possible—the commits would now be on no branches. They'd still exist! We can't change commits, and we can't even remove commits. But with no name to find the last one, we'd have to memorize the hash ID. (This is an insane way to try to use Git and we won't actually do that.)

This kind of thing is what I mean when I say that branch names don't matter in Git. The only things that actually matter are the commits. We use branch names to help us find the commits. That's important, for sure, but it's not actually crucial. But we can add and delete names whenever we like, as long as we are careful to make sure we have at least one name by which we can find the commits—well, if we still want to find them.

Sometimes, we might make some commits, then decide they are terrible. We can't actually delete them, but we can drop all the names that find them, and then we won't see them any more. They may, or may not, eventually get removed entirely. They generally won't get copied into other repositories, because only the findable commits get copied. We can just think of them as deleted, as long as we remember that anyone who has the raw hash ID might still be able to get them back.

Anyway, now that we have two names for commit H, we need a way to tell Git which name we're using. To do that, we have Git attach the special (to Git) name HEAD to exactly one branch name, like this:

...--G--H   <-- main (HEAD), test

This means we're using commit H through the name main.

If we now run:

git switch test

to switch to branch test—or git checkout test, which is an older command that does the same thing in this case—then we get:

...--G--H   <-- main, test (HEAD)

which means we're using commit H through the name test now.

If we do this before we make commit I, then commit I will look like this:

...--G--H   <-- main
         \
          I   <-- test (HEAD)

New commit I still points back to existing commit H, just as before. What's different is Git updates the name test, because that's the branch name we're using.

In other words, when Git makes a new commit, Git stores the new commit's hash ID into the current branch name. That's how we keep track of our latest commits.

Note that name main still points to commit H and commits up through H are still on both branches. New commit I is, however, only on branch test at this time. The latest main commit is commit H and the latest test commit is commit I.

If we now git switch main, Git will *remove the files that go with commit I—they're safely stored forever in commit I—and put in place the files that go with commit H instead. So commits not only store our files forever, they also let us revisit our files.

Moreover, suppose we create two branches br1 and br2 while we're on main at commit H:

...--G--H   <-- br1, b2, main (HEAD)

and then switch to br1 and create a few commits:

          I--J   <-- br1 (HEAD)
         /
...--G--H   <-- br2, main

Then, we switch to br2—which swaps out all our files for the commit-H ones—and make two other commits:

          I--J   <-- br1
         /
...--G--H   <-- main
         \
          K--L   <-- br2 (HEAD)

The permanent snapshots in commits I-J hold any changes we make while we do some work "on" branch br1. The permanent snapshots in commits K-L hold any changes we make while we do work on branch br2. The name main continues to point to the latest main commit.

Remote-tracking names

Now, while we're doing all this work—two different features, for instance, for branches br1 and br2—it's possible that someone else has been doing work directly on (their) main.

If they send their commits to some shared server repository on GitHub, for instance, they may add commits on to the GitHub repository. So the GitHub repository, which used to have:

...--G--H   <-- main

now has:

...--G--H--N   <-- main

If you now run git fetch origin, your Git will call up some other Git software on GitHub and discover this new commit N and bring it over to your own Git repository:

          I--J   <-- br1
         /
...--G--H__  <-- main
         \ --N   <-- ???
          K--L   <-- br2 (HEAD)

Note that your Git software needs a name by which to find new commit N. The name your Git software uses here is origin/main, which your Git builds by taking their name—main—and sticking origin/ in front.5 So we'll fix up our drawing like this:

          I--J   <-- br1
         /
...--G--H   <-- main
         \__
          \ `--N   <-- origin/main
           \
            K--L   <-- br2 (HEAD)

Note that earlier, you had an origin/main pointing to H, because git clone took their main—which at that time selected commit H—and made your own origin/main to select the same commit.

In fact, this is *how you got your main originally. The git clone command turned their main into your origin/main, selecting commit H. You had no branch names at all and then your git clone created one, namely main, from your origin/main which remembers their (GitHub's) main. This is kind of a long way around, but it shows how your branch names are separate from theirs.


5This name is technically in a different namespace, under refs/remotes/origin/main. However, git branch -r strips off the refs/remotes/ part when displaying these, just as git branch strips off refs/heads/ when displaying your own branch names. Oddly, git branch -a strips off refs/heads/ as usual, but only refs/ from the remote-tracking names, so that you'll see remotes/origin/main instead of just origin/main. There's no apparent reason for this: it's just a historical oddity.


Merging and fast-forwarding and git pull

I'm only going to touch on this topic briefly—there's a lot more to know here. But once you get an updated origin/main in your repository, you may want to update your own main to match.

To do this, you'll generally use git merge, perhaps git merge --ff-only.

The merge command in Git is quite large and complicated. It does a lot of things. But sometimes, it has a very easy job: sometimes there's literally nothing to merge. In this case—if Git can, and if you don't stop it on purpose—Git will "cheat". That is, suppose we didn't bother making any new branches or commits yet, and we have:

...--G--H   <-- main (HEAD), origin/main

We now run git fetch origin and discover that they made a couple of new commits:

...--G--H   <-- main (HEAD)
         \
          I--J   <-- origin/main

(The git fetch command tells us this, but git log --all --decorate --oneline --graph, or "git log with a dog", is actually really useful here; see Pretty Git branch graphs.) We'd probably like, at this point, to just move up to their latest and greatest main commit.

To do this, we run:

git merge --ff-only origin/main

or even just:

git merge

(which assumes that we aren't preventing --ff-only mode and that our main has origin/main set as its upstream, both of which are the defaults at this point). Git discovers that it can in fact make our name main point to commit J, their latest origin/main commit, and check out that commit, without doing any other work, so it does that:

...--G--H--I--J   <-- main (HEAD), origin/main

We're now using commit J, through our name main, which once again matches our origin/main: both our branch name main and our remote-tracking name origin/main select commit J now, and we now have all the files extracted from commit J, ready to be used.

Note that the two commands we ran here were:

  1. git fetch (or git fetch origin, but we can leave out the origin part since that's the standard name anyway), then
  2. git merge (with no arguments, which uses the defaults).

Many tutorials will introduce you to the git pull command. This command simply runs git fetch followed by a second Git command of your choice. If you don't set it up otherwise, that second command defaults to git merge. So git pull literally means git fetch and then git merge—and if you're 100% sure that you intend to run those two commands, in that order, without stopping between them to check things first, you can use git pull.

Personally, I avoid git pull. A lot of this is historical: back in 2006, git pull had some nasty bugs in it, and could destroy all your work, and I had this happen to me at least twice. Those bugs are long fixed. But some of it isn't: I like to run git fetch and see what git fetch did. That "see what git fetch did" part often includes running git log and/or git diff and/or other Git commands. And then, after I've seen what came in, I may choose not to merge after all. Doing this requires avoiding git pull. Since I'm already preconditioned to avoid it (due to PTSD from early Git bugs perhaps), that's what I usually do.

In teaching Git to other newbies, I've found that avoiding git pull, at least initially, helps a lot: they start understanding how commits work and what the individual commands do a lot faster. So I recommend avoiding git pull. But it's your choice.

torek
  • 448,244
  • 59
  • 642
  • 775