1

I have a parent directory called "Project" with two subfolders "Data" and "Writing". I have a git directory on "Project" and would like to push the "Data" folder to a private GitHub repo. I have an SSH connection to GitHub.

  1. How do I git push just the "Data" folder on to GitHub (while keeping the "Writing" folder on my local computer)?
  2. Should I create another git directory for the "Data" folder and git remote it to the GitHub repo? Or are submodules better?

I'd be grateful for the individual steps. Thank you.

SP1
  • 51
  • 6

2 Answers2

1

Git does not push directories / folders; it does not even push files. What Git pushes are commits, because this is what Git contains. A Git repository is, in effect, a big database of commits. Each commit is numbered, with a big ugly hash ID that is unique to that one particular commit,1 and your repository either has some particular commit—which it finds by its number—or it does not have that commit at all. So all Git needs to know is the number; that's how Git achieves the "distributed" part of its Distributed Version Control System existence. You connect two Gits, and they compare their numbers (just the numbers!) to figure out who has what.

Of course, commits themselves do contain files. These files have full paths—e.g., Data/some/name.ext, or Writing/another/file or whatever. Note that the slashes are part of the file's name: there are no folders here. Folders are an artifact of your OS, which requires that Git break these up at the slashes, and create folders.

When you work with files that you got through a Git repository, the files you work on / with are not the files that are in the commits in the repository! (Those files are stored in a Git-only format, read-only, compressed, and de-duplicated across all the commits. So the fact that each commit has a complete copy of each file doesn't matter, since two or more commits that share the same version of some file literally share a single copy.) Git just copies the commit version of some file—well, the entire commit snapshot of versions of files—out into your working space, for you to work on / with. Those are files, in folders, in your working tree. But they're not in Git: they were copied out of Git, and if you make a new commit, Git will make new copies of them—or share existing copies via its de-duplication process—in a new commit.

When you use git push, you choose commits to send to some other Git. The entire commit goes. If you want commits that only contain files named Data/some/name.ext, and not any files named Writing/another/file, you need to make such commits.

You can do that in one repository, or you can do that by making multiple different repositories (each of which will have its own separate working tree). It's quite tricky to do this in a single repository if that repository will also have commits that have Writing/another/file in them, but it can be done.

All of this is what leads to the organizations people actually use, when using Git. In general, one repository has one working tree; that one working tree has in it the files that came out of commits in that repository. All those files will go into future commits in that repository.2 Each commit has a full and complete snapshot of every file, even if the file hasn't changed. (That's why they're de-duplicated, in the special Git-only, not-normal-files-at-all storage format inside the commits. This is also how Git can store files that literally can't be extracted on your computer, if you run Windows.3)

You mention submodules. The thing to know about submodules is: they're just another Git repository. That is, you can have a Git repository—let's call it outer—that you clone:

git clone <url-of-outer>
cd outer                  # get into the clone we just made

You can now git checkout any commit that is in the repository you created with git clone, which copied all the commits4 from some other Git that answers the phone call / text messages / whatever analogy you like, at the given URL.

Meanwhile, some or all of the commits in the clone you just made say, in effect: Now, in sub/, get me commit a123456 from some another Git repository. The URL for that other Git repository is in a file in each commit, named .gitmodules, so that your Git, on your system, can do:

(mkdir -p sub && cd sub && git clone <url> .)

or equivalent if needed, to create the sub clone. Once the sub/ directory exists and has the other Git repository cloned, your Git, working in outer, can run:

(cd sub && git checkout a123456)

to get that commit checked out. Note that we're now up to 4 Git repositories: one at the URL you listed, yours in outer, one at the URL listed in .gitmodules, and one in outer/sub/.

Submodules can work pretty well, but there are several gotchas. In short, they're more work than repositories that don't have any submodules. They used to be so painful that people referred to them as sob-modules. The situation is better now, but still not great.

The big thing here is to wrap your head around the Git model: that a repository holds commits:

  • You check out a commit by its hash ID—usually, though, using a branch name to find the hash ID—and now you have the files from the snapshot stored in the commit. Each commit has a full snapshot of every file.

  • You work with your copies of those files in your working tree.

  • You use git add to have Git update its proposed next commit—which started out matching the commit you got out. The git status command works by comparing your proposed next commit, in Git's index aka staging-area, to what's in the current commit you checked out with git checkout or (in Git 2.23 or later) git switch.5 It also, separately, compares what's in the proposed next commit, to what's in your working tree.

  • Eventually, you run git commit to create a new commit from the proposed next commit. That commit uses the files in the index / staging-area—which is full of every file; git status only tells you about differing files to keep the output small—to make the new snapshot. It also adds various bits of extra information, such as your name and email address.6

  • A new commit changes which hash ID is stored in your branch names. Your branch names are yours, not someone else's.

  • Once you do have new commits, you use git push to send those commits to some other Git repository. This will add your commits to their branch directly. That is, push asks them to change their branch names.

  • Whether or not you have new commits, you use git fetch to get new commits from some other Git repository—usually, the one you cloned, called origin. This doesn't affect your branches! This only updates your remote-tracking names: your Git's memory of their branches.

  • Once you get new commits from someone else, you will probably want to incorporate them into some of your branches. This requires a second command, e.g., git merge or git rebase.

  • This two-comand sequence—fetch, then merge-or-rebase—is very common, so Git wraps it up into a single git pull command. I recommend avoiding this commit if you're a beginner, because there are things that go wrong. It's hard to know what happened when it ran two commands you don't really understand yet. If you know which actual command you ran, things may still go wrong, but now you know what to ask about. (But people do still like git pull, so if you do, just remember: it runs two commands.)


1This ID is unique across all Git repositories. There are some unusual conditions in which one hash-ID number might be used for two different commits in two different Git repositories, but if and when that's the case, those two Git repositories can no longer talk to each other, because the commit IDs need to be unique. The chance of having an ID re-used incorrectly is 1 in 2160 at the current time (and eventually will be even higher), which is small enough to be ignored.

2These future commits are made from something Git calls, variously, the index, the staging area, or the cache, but I won't go into details here. Just be aware of the fact that Git doesn't actually make the new commits from the working-tree files. The index is fairly well paired-up with the working tree in some ways, but you must explicitly git add files to copy them back from the working tree, into the index, before committing.

If you choose to use git commit -a, note that this just means run a version of git add before making the commit, so you're still really using git add. It's tempting to use commit -a to attempt to ignore the existence of this index / staging-area, but I find that this is unwise: Git will occasionally manage to slap you in the face with its index, and if you don't know what it is all about, you'll hate working with Git.

3For instance, Git can store a file named aux.h, but Windows can't. If you have a commit that contains an aux.h, and check it out, you'll get a complaint about the file name. Git can still work with these commits, and even make new commits containing an updated aux.h, though it's tricky to do that: the normal way of working, doesn't (work), because Windows can't store an aux.h file. But Git can, because committed files aren't files: they're special Git objects, in a frozen and de-duplicated format.

4This is a bit tricky, but it's also important to realize: When you clone a repository, you get all of their commits, but none of their branches.

More technically-correct, you get one branch name in your clone. The branch name you get is up to you: you supply it, at git clone time, with your -b parameter. For instance, git clone -b develop <url> has your Git create and check out a branch named develop. If you don't give a -b option, your Git asks their Git which branch name they recommend, and uses that name. The usual default is for your Git to use master, or main if they've switched their main branch name to main (e.g., newer GitHub repositories).

Meanwhile, your Git does know about their branch names. Your Git saw them all during the git clone operation. What your Git does with their branch names, though, is turn them into your remote-tracking names. These are names like origin/master (or origin/main), origin/develop, and so on. That is, we just stick the word origin/ in front of their branch names, to make your remote-tracking names. These names remember their branch names for you.

These remote-tracking names allow your Git to create your own branch names from their names. But when you do that, that branch name is now yours. Their Git can't mess with your branches because your Git keeps changing their names to your remote-tracking names. There is a lot more to know here, but this is plenty to get started with.

5The git checkout command has two modes: a "safe" one that won't wreck unsaved work in your working tree, and an "unsafe" one that, if you ask Git to use it, will. Unfortunately some of unsafe kinds of git checkout look exactly the same as the safe kinds. This is a bad situation, and hence in Git 2.23, the Git folks split the checkout command into git switch—which is only the "safer" part—and git restore, which runs the "unsafe" please wipe out my work, I've decided to throw it out set of operations. If you have Git 2.23 or later, it's wise to teach your fingers the new git switch habit. (I'm still working on this myself.)

6This extra information is the commits metadata. You don't need to learn this technical term, but you will eventually need to understand at least some of what goes in this metadata. We'll leave that for later too though.

torek
  • 448,244
  • 59
  • 642
  • 775
1

Should I create another git directory for the "Data" folder and git remote it to the GitHub repo?
Or are submodules better?

Actually, even if you reference Data as a submodule, that still does mean Data is its own repository.

So yes:

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250