There is nothing built in to Git for this, so you will have to write code.
There's an enormous problem with attempting to do this for any particular file right after running git clone
, but you added this remark:
total history will be better, but I can settle with only mine and one branch. I want to be able to lookup total history without having to interact with Git.
in which case there's an obvious path forward. I will outline one idea for you, but you will have to write the code. If you know a lot about Git, jump down to the bottom section about using the post-commit hook. If not, read through the rest first. You'll learn a lot about Git by writing the post-commit hook, but you will probably need the other sections too.
First, keep in mind what untracked files are
If you are going to use Git at all, Git forces you to learn about its three parts:
The work-tree. This is pretty simple: it's where you do your work. Files in the work-tree are stored in the usual form, where you can see them and work with them.
The index, which has two other names because it's so important in Git: it's also called the staging area and sometimes the cache. Files in the index are in the special Git-only format. The key here is that you can replace files that are in the index, so they're write-able.
Commits. Commits are permanent, read-only, and incorruptible.1 Commits in Git are the history: there's no such thing as "file history"; each commit is a complete snapshot, with its contents independent of every other snapshot. Git makes new snapshots by saving (committing) the contents of the index.
An untracked file is one that is not in the index. This is a rare case of Git being simple and clear. :-) If you have a file in the work-tree that's not in the index, it's untracked. All your landing.html.suffix
files will be untracked.
1The permanence of commits depends on their reachability. As noted in the section below on commits, Git finds commits by starting from a branch name (or any other name that identifies a commit). Those commits identify their parents, by their hash IDs, so the parents are reachable from the branch tips. The parents identify yet more parents, so those are also reachable. Git will, rarely (because it takes a long time), compute the transitive closure over the set of reachable commits—really, reachable objects—and compare this to the entire contents of the object database. Unreachable objects may, depending on additional criteria, be garbage-collected (discarded) at this point.
The incorruptibility depends on the fact that they are read-only and hashed. If something somehow changes inside an object, it will cease to match its (cryptographic) hash ID, and Git will know it is damaged.
Some notes about commits
(None of this is directly relevant but it's useful to keep it all in mind.)
Commits, like all of Git's internal objects, are identified (named) by their hash ID. The hash ID of an object, including each commit, is a cryptographic checksum of its contents. The actual contents of each commit is pretty small, because the stored snapshot is done through a separate Git object called a tree: Git turns the index into a tree, then saves the tree's hash ID, plus your commit metadata (your name and email address, some time stamps, your log message, and the commit's parent hash ID) as the commit object.
Branches, and thus the history in a repository, exist because commits store parent IDs. A branch name like master
simply holds one (1) commit hash ID. Git calls this the tip commit, and it is by definition the last commit on the branch, i.e., the newest. To find a history, Git looks at the tip commit's parent commit, which is the second-to-last. Then Git looks at the parent's parent, which is the third-to-last; and so on. The resulting chain-of-commits is thus the branch, as found by the branch name, which identifies only the tip-most commit:
D--E <-- master
/
A--B--C
\
F--G <-- develop
Commits A
through E
are all on branch master
, and commits A
through C
plus F
and G
are all on branch develop
. Note that some commits are on more than one branch. The history stored in the repository is simply the sum of all the commits stored in the repository. Note that the names, master
and develop
here, identify only one commit each.
You could, if you wanted, make a repository with a single linear branch in which every commit is completely unrelated to the previous commit. More usefully (but still deliberately perverted), you could make a repository where every other commit has a different project in it, so that if you check out the first commit, you get Project A's initial attempt. If you check out the second commit, you get Project B's initial attempt. The third commit is the second commit of A; the fourth commit is the second commit of B; and so on. In other words, an even-numbered commit N is Project B, commit N/2; an odd-numbered commit is ProjectA, commit floor((N+1)/2).
The key point here is that commits are not change-sets. If the same file appears many times in a row in many commits in a row, each commit has its own independent copy of that file. It's true that somewhere, deep down in Git's underbelly, they all share a single "true copy" of the file (and for identical objects this turns out to be really easy for Git to do; for slight variations, Git has to put the objects into what it calls a pack file to delta-compress them).
What this really means is that in order to talk about things that have happened to a file, or to some set of files, you must pick some commits to compare, one pair of commits at a time. The obvious thing to do is to compare each parent/child pair. This works as long as the commits are linear:
... G--H--I--J <-- develop
Here, the G-H
pair, the H-I
pair, and the I-J
pair make for useful comparisons. But suppose this is part of:
D--E
/ \
A--B--C M <-- master
\ /
F--G--H--I--J <-- develop
where commit M
is a merge commit on master
, where someone merged develop
into master
at that point. Commit M
has two parents, not just one: will you compare M
to E
, or to G
? Meanwhile, the branches forked apart at C
, so C
has—at the moment; we could add more any time!—two children. Will you compare C
to D
, or C
to F
? These are the really sticky parts, which you can avoid by "settl[ing] with only mine and one branch".
Making commits
As you no doubt already know, the process of making a commit consists of doing the following steps:
- Check out some branch name: this makes its tip commit be the current commit. There are some important facts about this: in particular, how this affects the index and work-tree. We'll get back to this in a moment.
- Make changes in the work-tree. The files in the work-tree have their ordinary read/write form , so this is pretty easy.
- Run
git add
. What this really does is to copy the updated files from the work-tree into the index, replacing the un-edited index files.
- Run
git commit
. This collects your commit log message, then makes the actual commit object.
The tricky part of making the commit is turning the index into a tree object (for which there's a separate command, git write-tree
, that you can run if you want to do it all manually). Once Git has the tree object, it can write out the text of the commit:
tree <hash>
parent <hash>
author <name> <email> <timestamp>
committer <name> <email> <timestamp>
<log message>
and then turn this into a commit object (you can do this part manually too, if you like, using git hash-object -w -t commit
). Creating the object creates the hash ID for the object, by computing the cryptographic checksum of the text. As long as this commit is different from every other commit—and the timestamps plus the rest of the contents ensures that it is, since the time is always increasing2—it gets a new, different-from-every-other-commit hash ID. Note that the parent <hash>
line uses the hash ID of the current commit—the one you checked out in step 1.
Git then simply writes the new commit's hash ID into the branch name, so that the current branch—the one you checked out in step 1—now identifies the new commit as its tip. Last, and this is where you will be able to do what you want, git commit
runs a post-commit hook.
The above can be confusing, so let's draw an example, with a simple three-commit repository becoming a four-commit repository:
A--B--C <-- master (HEAD)
The name master
points to commit C
. You git checkout master
, make some change, git add
and git commit
and create new commit D
. The new commit points back to C
as its parent:
A--B--C <-- master (HEAD)
\
D
and then Git quickly slides the name master
down-and-right, as it were, so that it points to the new commit D
:
A--B--C
\
D <-- master (HEAD)
after which we generally straighten out the drawing so that it looks like a simple line again.
Note that you can run git commit --amend
, which makes the new commit have the current commit's parent as its parent. That is, instead of having D
point back to C
, we can have D
point back to B
:
A--B--C
\
D <-- master (HEAD)
This makes the history go D -> B -> A
, skipping C
(which has become unreachable and will eventually be garbage-collected). In other words, we haven't actually changed history—C
is still in there, it's just no longer in our history linkage—but it looks like we have. If you will ever use git commit --amend
, keep this in mind in your Git hooks later.
(Git's git rebase
has a similar effect, but considerably more drastic: it copies multiple commits to new commits, abandoning the originals.)
2If, by trickery and subterfuge (or by just running git filter-branch
which uses trickery and subterfuge), you manage to make a new commit that is bit-for-bit identical to an existing commit—it has the same author and committer, the same timestamps, the same parent, the same source snapshot, and the same log message—then you will re-use the old commit's hash ID. But so what? You just made a new commit that's exactly the same as the old commit. It has the same author, was made at the same time, has the same history, and has the same log message. It is the old commit.
There's an oddball case here with making two identical commits very fast (within one second) on two different branch-name checkouts when both branch names point to the same tip commit. This causes the branch names to wind up pointing to a single, shared new commit, even though you expected them to point to two different commits, and they would have if the process had spanned a clock-tick. The result is correct, in a graph-theoretical sense, and works; but it is surprising.
Filling in blanks, or rather, filling in the index and work-tree
I mentioned that step 1 above—the git checkout branch-name
step—has an important effect on the index and work-tree. Note that when Git made the new commit above, it started by writing out the index to make a tree object, using git write-tree
. This means that the index must start out matching the current commit.3
The git checkout
command achieves this by comparing the current (pre-checkout) commit to the target (post-checkout) commit. The current commit has some set of files, and the target commit has another set of files, presumably at least a little different. Checkout will remove, from the current index and work-tree, those files that must be removed. It will add into the current index and work-tree any files that must be added. It will replace, in the index and work-tree, any files that must be swapped out, to go from the old commit to the new one.
As a result, after git checkout
, the index and work-tree will—except for untracked files, that aren't in the index at all—match the target commit, which has just become the current commit.
Note, too, that when you run git commit
, this makes the new commit using the current index. The result is that once the new commit is done, the current commit and the index match again. So we get a basic (although slightly flexible, see footnote 3) truth about Git: The index normally matches the current commit, up until you start git add
ing to copy files from the work-tree.
3Actually, some difference is allowed to carry over across checkouts. See Checkout another branch when there are uncommitted changes on the current branch for details.
Using a post-commit hook to get what you want
Git runs your post-commit hook right after git commit
finishes successfully. This git commit
has made a new commit, such as commit D
in our example of turning a three-commit repository into a four-commit repository.
The new commit has a parent, such as C
. Now you have a chance to compare parent to child:
git diff --name-status HEAD^ HEAD
for instance. (HEAD
is the current, i.e., child, commit, and HEAD^
means look at the first parent of HEAD
. Keep merge commits, which have multiple parents, in mind here: you can use HEAD^2
to look at the second parent of a merge, for instance. I'm not sure, off-hand, whether git merge
runs the post-commit hook, when git merge
makes a merge commit, although I suspect that it does.) The output from git diff --name-status
tells you what happened to each file that it prints; see the git diff
documentation for details.4
At this time, if some file such as landing.html
has changed (status M
), or a new file has been created (status A
), you can make a copy of the file under the next version number, and using the commit log message subject (git log -1 --pretty=format:%s HEAD
). If the file hasn't changed, you get no output—git diff
says nothing because there's nothing to say—so you make no copy.
The result, over time, is that you will build up, in your work-tree, the untracked files that you want as your history, numbered by the order in which you make these commits. To make the numbering mean something, you can even check which branch you're on (if any—in "detached HEAD" mode, such as when you are looking at historic commits, HEAD
is not attached to a branch name at all). Note that you can use git rev-parse --abbrev-ref HEAD
or git symbolic-ref --short HEAD
to get a branch name.5
4For scripting, you should really use git diff-tree
, which is more predictable. It doesn't obey per-user configuration controls, for instance, so it behaves the same for everyone. git diff
will look at your diff.renames
setting, your diff.renameLimit
, and so on, as well as diff-output coloring options, all of which can mess with scripting.
5The difference between the two is that git symbolic-ref
will fail (exit nonzero), and produce no standard output (but will write to stderr by default), if HEAD is detached. git rev-parse
will just print HEAD
for this case.