2

Question

Is it possible to compute a git hash of a file or directory outside of and independent of any git repository?

Motivation

I want to use this method to identify differences in generated artifacts (e.g. css generated from sass).

The benefit of doing this with git would be that the hashes can be compared against existing file hashes in a git history, to see if they look familiar.

Background

In How to compute the git hash-object of a directory? we learn how to compute the git hash of a directory. This method only works if the directory is within a git repository.

git ls-files -s somedirectory | git hash-object --stdin

From my understanding of git, the git hash of a file or directory depends only on the file or directory contents, perhaps file perms, but not on anything in the repository.

Known methods

Yes we could temporarily create a repo, but why that extra step?

donquixote
  • 4,877
  • 3
  • 31
  • 54

1 Answers1

3

Files are easy; directories are hard. Read the Python code for directories; but files are just the checksum (SHA-1 for now, SHA-256 in the future) of the file's contents preceded by a blob header that includes, ASCII-fied in decimal, the size of the blob, plus a byte to separate the header from the data. That is, for a twelve-byte file, we have blob 12\0hello world\n as the input to sha1sum or whatever your local command or method of computing an SHA-1 checksum may be.

(You can also simply use git hash-object, for plain files. Directories remain hard.)

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thanks for the quick reply. I think it might help to split the answer into sections (with headlines) for files vs directories, and also add some structure. And then some detail why dirs are hard. `git hash-object` is fine for me for files, but the additional info you gave is useful too. (don't get me wrong I am giving +1 and might also accept it after some wait time) – donquixote Oct 26 '21 at 22:56
  • Could there be a scenario where two dirs with identical content have different hashes? E.g. if one of them is the repo root dir, and the other is a subdir in a bigger repo? I was able to confirm that two distinct and independent repos with same content do have the same tree hash on the toplevel dir. But not sure how to analyse subdirs. `git show --format=raw TREE_HASH` does not show me the hashes of files and subdirs, which is sad. – donquixote Oct 26 '21 at 23:17
  • Actually I can confirm that subdir gets same hash as root dir if same content (as to be expected), using `find .git/objects/*/*` and some detective work. Only `git ls-files -s PATH | git hash-object --stdin` seems to be totally unreliable. – donquixote Oct 26 '21 at 23:27
  • 1
    The hash of a subdirectory depends on the file names and modes *in* that directory plus the blob hash IDs of those files. So if you have two identical sets of names and hash IDs, you get identical tree hashes. To view the hash ID of some tree in a commit, use `git rev-parse`, e.g., `git rev-parse master:Documentation/RelNotes` (in the Git repo for Git I currently get `3fdf1a9d5e5b11d53f4d4a991f81dda446ded0f4`). – torek Oct 27 '21 at 00:26
  • When using `git ls-files -s`, you're getting the result of scanning the index. Turning the index into a tree (`git write-tree`) gets you the hash ID for the complete snapshot. – torek Oct 27 '21 at 00:27
  • For files you can read [https://git-scm.com/book/en/v2/Git-Internals-Git-Objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) chapter "Object Storage" or the full document, @donquixote – Alexey Burdin Jul 07 '23 at 11:58