I'm not going to edit your question since I cannot be sure that this is your intent, but it sounds like your real question comes down to this:
Suppose I have a directory full of source code. Suppose further that I believe this directory-of-source-code was created by running a series of git
commands, such as:
git checkout [some hash ID]
git cherry-pick [another hash ID]
but Evil Spirits, or fluoridation, or some such, has lost the .git
directory. So all I have now is this tree of code. Let's call this "my lost tree".
Meanwhile, on some other machine or in some other directory, I do have a full Git repository. I am curious as to what commit or commits (there may be more than one) would, if I ran git checkout <hash>
, get me a work-tree that is identical to my lost tree.
Now, it's not clear what the point of all this is, but it is possible to do, with some caveats. The easy way to do it is to add the lost tree to the full Git repository—or, if you are concerned about precious bodily fluids :-) (see the "fluoridation" link above), to a git clone --mirror
of it. (The clone is "as good as" the original, but can be thrown away after this process.)
Things to know before steps 1 and 2
In any normal Git repository, there are three things of interest while you are working on making a new commit:
- the current commit, which is your
HEAD
;
- the index, which is where you build the new commit; and
- the work-tree, which is where you keep files in a form the rest of the computer can deal with, as the files stored in the current commit and in the index are in a form only Git can deal with.
As I mentioned in a comment, the repository database itself has four object types: blob
(a file); tag
(an annotated tag: a human-readable name for some commit object, plus some other metadata); commit
(metadata including a log message, an author and committer, the parent commit(s) IDs of the commit—it's these parent IDs that produce the history, when commits are viewed according to parent/child relationships); and tree
. A tree
object contains the names and hash-IDs of blobs and of additional sub-trees, and hence could represent a tree of files identical to your lost tree. Each commit has, as part of its metadata, the hash ID of the commit's stored snapshot, i.e., the tree. So if your lost tree is in fact in the repository, its hash ID is stored in some commit(s).
We will make use of the fact that, once some object is in the repository, any attempt to put a bit-for-bit-identical new object into the repository simply re-uses the existing object. Making one grand assumption,1 this re-use is fundamentally safe: if the new object is bit-for-bit identical to the old object, why would you care which object—new or old—Git pulls out later when you ask Git to retrieve the object by its hash ID?
1The assumption is that no two different objects produce the same hash ID, ever. The pigeonhole principle tells us that this assumption is false in theory, but in practice, it's actually true. It is possible, but currently very expensive, to break the assumption deliberately. A longer hash ID can—again, in theory at least—be less-breakable, although cryptography is always getting weirder. :-)
In fact, if two different Git objects do produce the same hash, Git breaks ... well, sort of; it breaks, or should break, in a "safe" manner. The example PDFs that break SHA-1 do not actually break Git, though. Other files could—but meanwhile, some minor coding glitches in existing versions of Git apparently cause it to fail to alert the user to the fact that it will fail to store some new version in your repository.
Step 1: Finding the hash ID of your lost tree
The work-tree corresponds to your lost tree, but Git won't make a tree object out of a work-tree. Git will only make a tree object out of the index. This means that in order to find the hash ID of your lost tree, you must copy it into the index.
The index already has stuff in it, so your first step is to remove all of it. In the top level of your Git repository, tell Git to remove everything, both from the work-tree and from the index:
git rm -rf .
You should now have an empty work-tree ... unless—this is one of the caveats—there are untracked files.2 If there are some untracked files, you will have to find out (or guess) whether those are also in your lost tree, and whether they will also be untracked in the commit(s) in the full repository that use that same tree. I leave it to you to find solutions to this problem, should you actually have this problem. (It's possible that even if there are untracked files in the repository, they were and are not present in your lost tree.)
In any case, you probably want to discard any untracked files at this point. If there are any, you can use git clean -fdx
to discard them. (This can be a good reason to work on a fresh mirror clone: it won't have any such untracked files in the first place, and removing such files from a real work-tree may force you to rebuild them later, which for a large project, might be many CPU-hours of computation.)
Now that your Git index and work-tree are empty, we will re-fill them from the lost tree:
(cd /path/to/lost/tree; tar cf - .) | tar xf -
or:
cp -R /path/to/lost/tree .
or whatever, so that the work-tree is now a copy of the lost tree.
(At this point, you must throw out, from the copy, any files that should be untracked. Since we removed everything, we also removed any .gitignore
files that we had before, so files that would be ignored, if this were a normal setup, won't be, unless those .gitignore
files are in your lost tree. Again, how you do this, if you need to do it at all, is up to you.)
Second-to-last, we now want to populate our index from this work-tree. This part is simple:
git add .
does the trick. We now have a full index and can produce a tree object and find its hash.
The "normal" way to do this is to make a new commit. If we make a new commit now, it will have as its parent, our current (HEAD
) commit. It will be added to our current branch. There's nothing wrong with this, but that's not our goal at this point, so we can use a lower-level Git command, one of the so-called plumbing commands:
git write-tree
What this does is turn the index into a series of tree objects, one for each sub-directory (and that sub-directory's files) stored in the index, and one final top-level tree, for the files and sub-directories at the top level, i.e., for .
. The output is the hash ID of the object just added to—or reused from—the existing Git object database:
$ git write-tree
b3bb4696cf8dcb93c1f09a447f6b7356bccb24d2
This tree hash is what we are looking for, but it's not a commit hash. We simply believe that there may be one, two, or many existing commits that have this tree as their hash.3
2An untracked file is simply one that is not in the index. This simple definition is not a problem for us unless and until it becomes a very big problem: if your lost tree contains untracked files, you don't know which ones are untracked because the index that made them untracked was part of the Git repository you lost when you acquired the lost tree in the first place!
3If we use git commit
to make a new commit, the new commit we just made will have this hash as its tree
object. That's not the commit we're looking for, of course—but if you use git commit
instead of git write-tree
, this is something to keep in mind.
Step 2: finding commits that have this tree
The remaining caveat, of course, is that it's quite possible that no commit has the tree your just made; or it may be in two or more commits. The latter occurs from time to time due to git revert
or trivial merges (merges that could be, but on purpose are not, fast-forwards). The way to deal with this is to find all such commits, then decide which one(s) you want.
To find these commits, our first sub-step is to enumerate every commit in the repository. We need their hash IDs, so that we can use another Git plumbing command to find their tree ID (remember, each commit has exactly one tree). The command to find every commit or other object reachable from some name is git rev-list
; the option to use all names is --all
; so:
git rev-list --all
does the trick. This prints each hash ID to its standard output stream, so we will now collect all those IDs and turn them into their corresponding tree hashes.
One slight wrinkle here is in the phrasing above: this finds all commits or other objects, including annotated tag objects. An annotated tag is a name for another Git object, usually a commit object. So if we find that annotated tag v1.3
and commit 1234567...
both name your lost tree, we'll see two hash IDs here. That's probably actually what we want, but if not, you now know what to look for to change this.
In any case, to turn the rev-list
ID into a tree, we will want to use git rev-parse
. It's possible that the ID cannot be turned into a tree: an annotated tag object, for instance, might tag a blob object rather than a commit. So for a fully robust solution we should check, using git rev-list --verify --quiet
and checking its return value:
lookfor=...put in the hash ID you are searching for ...
git rev-list --all |
while read hash; do
tree=$(git rev-parse --verify --quiet ${hash}^{tree}) || continue
if [ $tree = $lookfor ]; then
echo "found: $hash (type $(git cat-file -t $hash)) names tree $lookfor"
fi
done
(the above is untested but it's too simple to be wrong).
If this finds any objects that refer to your lost tree, you now have the hashes (of commits and/or tags) for it.
If this finds no objects, that means that either you put in the wrong tree—see the caveat about untracked files—or the tree you have does not have a corresponding commit. That does not mean it never had one: perhaps your lost tree was part of an experimental branch that was deleted and had all its commits thrown out during garbage collection, for instance. It just means that no commit has that tree now, in your full repository (or its mirror-clone).