How to find the file a specfic blob is associated with?

Question

So I accidentally deleted some files using git clean --, when trying to remove them from working directory after creating git-ignore file.

I remember that I had staged them at one point, so I ran git fsck --dangling, and now have a list of blobs, tree and commits. If I look at the blobs, I only see the content, but not which file the content is from.

I had accidentally deleted some data files (8-3-21.csv, 8-4-21.csv etc). Lot of these data files are similar just with minor changes, so I cannot tell just by the content.

So I want to see if these dangling blobs are associated with the data files I deleted accidentally (2 files).

Does this answer your question? [Recover dangling blobs in git](https://stackoverflow.com/questions/9560184/recover-dangling-blobs-in-git) Basically that information is lost. A dangling blob is dangling. — matt, Aug 06 '21 at 03:28
@matt No the information isn't lost as it's a dangling blob. I can see contents of the blobs and it seems to be in there, but there are so many blobs. I just need to find the filenames associated with the blobs. — MasayoMusic, Aug 06 '21 at 03:37
*Trees* record the association between blobs and filenames (well, paths, but they include filenames obviously), not the blobs themselves, which are technically unnamed (and can be referenced under different trees with different paths). Did you try to search dangling trees for your filename? — Romain Valeri, Aug 06 '21 at 07:22

LeGEC · Answer 1 · 2021-08-07T06:01:16.463

2

If you only staged then unstaged your files, git has written their content (in blobs) but hasn't stored their names (for some reason, git currently doesn't create a tree object when the index is updated).

If your files are part of these dangling blobs, you are left with identifying them through their content only.

You can use git grep to grep through a set of git objects.

You can use a trick to find the date when the dangling blob was created : look at the creation date of the file .git/objects/da/nglingblobhash

You may look at the script in this other answer for a way to check the dates of a complete set of blobs.

To answer your comment :

cat list_of_blobs.txt | while read hash; do
    echo "===== $hash"
    git show $hash | head -5
done

edited Aug 07 '21 at 06:01

answered Aug 06 '21 at 09:37

LeGEC

46,477
5
57
104

The "some reason" is: there's no point as the index might be updated again; we need to create multiple tree objects, and the original overall design deferred this until `git commit` time. The index does (now) have the ability to cache tree object hash IDs to speed up the re-use of existing subtrees, but it still seems wiser to defer making *new* trees. – torek Aug 06 '21 at 13:46
Creation file might help as only one file is created daily. I will try this out. – MasayoMusic Aug 06 '21 at 21:54
I've checked out your other post. So I don't see blobs for the days of when the data is supposed to be there, because I probably left them as untracked file for a few days and then did the git add on one day for multiple files. I still haven't really studied bash (only used windows cmd), but is there a way to get to also get first 5 lines of each of the blobs along with file meta data that's being printed. Thank you. – MasayoMusic Aug 06 '21 at 22:57

score 2 · Accepted Answer · answered Aug 06 '21 at 13:42

Blob objects store only the file's data, not the path name (nor the mode).

What this means is that if we make one commit, or many commits, containing the same data, we get the same blob hash ID:

$ echo test data > file
$ git add file
$ git commit -m "add some test data"
[commit message here]
$ git rm file
$ git commit -m "remove the test data"
[commit message here]
$ echo test data > different-name
$ git add different-name
$ git commit -m "add the same data under another name"
[commit message here]

If we inspect these commits, we will find that both files, file and different-name, have the same blob hash ID, even though they have different file names and do not coexist in adjacent commits. In fact, the blob hash ID of test data\n is:

$ echo test data | git hash-object -t blob --stdin
082b3465b6ac4b857f930b655c1cdb398aa6c465

This is the hash ID of any blob holding exactly that string. The hash ID of a blob holding hello world\n is equally predictable:

echo hello world | git hash-object -t blob --stdin
3b18e512dba79e4c8300dd08aeb37f8e728b8dad

What all of this means is that the file contents alone, not the file's name, determine the hash ID; if the contents themselves are not unique to that one path-name, there are multiple file names for that blob. This is how Git de-duplicates file content across commits (or even within commits).

As matt noted in a comment, the names are stored in tree objects. Technically, a tree object stores a (sorted) list of 3-tuples: mode, name-component, hash-ID. The git add command prepares a file for committing by using the equivalent of git hash-object -w on the file's contents, to store the blob object into the repository database or find any existing blob object with that hash ID, and then writing the corresponding hash ID into Git's index. Git does not—yet—create any tree object for this.

Later, if and when you run git commit, the commit code uses the equivalent of git write-tree to turn Git's index contents into one or more tree objects, re-using or creating new tree objects as needed. The index contains the file's path name, including (forward) slashes, such as path/to/file.ext; git write-tree reads this and figures out that, in order to store the file, we'll need at least three internal tree objects:

One tree object will contain path, with mode 040000 (though leading zeros are actually suppressed in the internal format), and a hash ID. That will be the hash ID of the next tree object:
One tree object will contain to, with mode 040000 (again with leading zeros suppressed), and another hash ID:
The last—or first, in some sense—tree object will contain file.ext, with mode 100644 or mode 100755 as seen in Git's index, and the hash ID as seen in Git's index.

By using these three tree objects, Git will later be able to re-create, in a new index file, the path/to/file.ext string with the mode 100644 or mode 100755 part and the correct blob hash ID. From there, Git will create or update the file path/to/file.ext, perhaps by creating a folder path, then a folder path\to, and finally a file file.ext in the to folder in the path folder.

So, as noted in comments, if the contents are unique, you'll be able to find this dangling blob (using git fsck as you did), but Git never got around to storing the file's name anywhere except its own index, which it has since overwritten. While it seems to be partly broken in current Git releases, git fsck --lost-found followed by "grep"-ing for contents in the resurrected dangling blobs is usually the way to go here.

How to find the file a specfic blob is associated with?

2 Answers2