Add an immutable unique id to each file

Question

I wanted to add a unique id to each file that is committed to git. This id should never change regardless of any other changes made to the files.

There is a similar question here: Unique identifier for file in Git Repository where it is advised that by default git does not add a unqiue id to files that are committed. However, this does not fully explain how you might achieve that.

Can anyone advise if it is possible to add a unique immutable id to each committed file?

the sha1 + the file path could give you such value; if you need a sha1 or md5, just pass them to md5 or sha1sum — OznOg, Oct 01 '18 at 19:40
Why do you want such ID? The best unique ID for a file is its path from the repository root (well, it's unique until the file's renamed or moved). Do you need anything better than that? — phd, Oct 01 '18 at 19:47
I am referencing the file in a database table so I need an ID that links the file to the record in the database table. — SSS, Oct 01 '18 at 19:49

score 1 · Answer 1 · answered Oct 01 '18 at 20:19

Given a path name within a repository, and optionally a commit specifier, you can examine Git's unique ID for the file's content:

$ git hash-object -t blob Makefile
5a969f5830a4105d3e3e6236eaa51e19880cc873
$ git rev-parse :Makefile
5a969f5830a4105d3e3e6236eaa51e19880cc873
$ git rev-parse HEAD:Makefile
5a969f5830a4105d3e3e6236eaa51e19880cc873

(These three copies of the file are all identical, in this case. Makefile is in the work-tree, :Makefile is in the index, and HEAD:Makefile is in the current commit.)

$ git rev-parse v2.1.0:Makefile
2320de592e6dbc545866e6bfef09a05f660c2c14

(The version of Makefile committed in commit v2.1.0 is not the same as the three above.)

Note that although Git still uses SHA-1, this is not the same as the SHA-1 of the file's actual content:

$ sha1sum Makefile
857f75d0f314501dfdfcc5b6a4306eba1faddd31  Makefile
$ python
[python startup messages]
>>> import hashlib
>>> hashlib.sha1(open('Makefile', 'rb').read()).hexdigest()
'857f75d0f314501dfdfcc5b6a4306eba1faddd31'

This is because Git is checksumming the data after tacking on a header:

>>> data = open('Makefile', 'rb').read()
>>> hashlib.sha1('blob {}\0'.format(len(data)).encode('ascii') + data).hexdigest()
'5a969f5830a4105d3e3e6236eaa51e19880cc873'

Note, however, that if you add a header to a file, then checksum the resulting file, you'll get a new and different checksum because you're now checksumming the header plus the file data. If you store the new checksum into the file, and checksum the result, you'll get yet a third checksum. To avoid this problem of ever-changing checksums, you need either a weaker checksum—one where you can compute the right input to get a desired output (e.g., IP header style checksum)—or to checksum the data excluding the checksum itself. Or, of course, you can store the checksum outside the file, as Git does.

If you have some other source for unique identifiers, you can just generate them, rather than linking them to the file's content. How to do that is up to you.

Add an immutable unique id to each file

1 Answers1