2

I want a plain-text pointer to a file in a git repository, at a particular revision -- specifically, a revision that modified the file in question.

One way that I can do this is to use a tuple of (path/to/file, revision) where the revision is something I can get from git log --format=oneline path/to/file.

Is there a more efficient way to achieve this?

I know that git's object database stores filenames separately from file data, but is there a way to convert from a filename to that object ID and back again?

Ian
  • 11,280
  • 3
  • 36
  • 58

2 Answers2

4

git rev-parse turns the apparently-difficult part easy:

$ git rev-parse ace6325:Documentation/RelNotes/2.5.0.txt
994b113178d966f7044ebe7b17d981df26ecd022

You don't need a raw SHA-1, any parse-able revision will do:

$ git rev-parse HEAD~5:Documentation/RelNotes/2.5.0.txt
994b113178d966f7044ebe7b17d981df26ecd022

Given the SHA-1, git cat-file -p will extract its contents (or you can use git show as well; git show may try to apply smudge filters; I'm not entirely sure about using git show).

Note that this SHA-1 is a checksum of the actual contents of the file, i.e., if the file is changed in one commit, then changed back in a later commit, the SHA-1 will revert back to the old SHA-1.

torek
  • 448,244
  • 59
  • 642
  • 775
  • So can you convert `994b113178d966f7044ebe7b17d981df26ecd022` back into a revision and path? – Ian Oct 13 '15 at 19:49
  • No, there's no guaranteed unique inversion (that file may appear in millions of commits). Moreover, as I noted above, it may disappear and then re-appear. – torek Oct 13 '15 at 20:08
3

Yes. Well, sort of. In git, files are stored internally as something called a blob. Blobs are identified by a SHA-1 in a manner very similar to that of a commit. You can find the SHA-1 of any file based on its contents using git hash-object:

git hash-object -- <path>

If you want to find the hash for a file at a specific revision, git hash-object also accepts file contents from standard input:

git show <revision>:<path> | git hash-object --stdin

Then you can retrieve the text of that file with git show:

git show <blob sha-1>

Going from a blob id to a path and revision is a lot harder. Blobs can be stored at multiple paths across multiple revisions, so there's really no single definitive path/revision combination that a blob is stored at. It's possible to find a list of which revisions and paths contain a given blob (see Which commit has this blob?) but that's farily complicated and doesn't really sound like what you want anyway.

If you really want a textual representation of a file at a specific revision, then your original idea of (<path>, <revision>) seems perfectly reasonable to me. <revision>:<path> would also work well, as that's the format accepted by git show (as demonstrated in the example above).

Community
  • 1
  • 1
Ajedi32
  • 45,670
  • 22
  • 127
  • 172
  • My only hesitation with using revision is that for a sufficiently old file, `git show HEAD:README.md | git hash-object --stdin` is the same as `git show HEAD^^^:README.md | git hash-object --stdin`. In other words, I'm wary of having multiple valid identifiers for the same thing. – Ian Oct 13 '15 at 19:39
  • @Ian Then don't use `HEAD` as the revision. Use a commit SHA-1 or something else that doesn't change, like a tag name. – Ajedi32 Oct 13 '15 at 19:41
  • @Ian Also, you can convert a revision like `HEAD` to something more permanent using `git rev-parse`. E.g. `git rev-parse HEAD` will give you the SHA-1 of the current commit, and unlike `HEAD`, that SHA-1 won't ever change. – Ajedi32 Oct 13 '15 at 19:47
  • Sure, I meant more that using the SHA1 corresponding to `HEAD` and the SHA1 corresponding to `HEAD^^^` in the context of those commands will return the same result. – Ian Oct 13 '15 at 19:47
  • @Ian Right, `HEAD` isn't suitable to use as the revision if you want a revision that retains the same content permanently. Use a SHA1 or tag name instead, and you'll have a more permanent identifier. – Ajedi32 Oct 13 '15 at 19:49
  • Let me say it another way. If `README.md` wasn't modified between commits `abc123` and `def456`, then `git show abc123:README.md` and `git show def456:README.md` both show the same underlying data even though two different IDs were supplied. Is there a way to collapse `abc123` or `def456` to the actual revision where `README.md` was actually modified? – Ian Oct 13 '15 at 20:03
  • @Ian I think your requirements are contradictory. In order for your hypothetical function, `magically-get-file-content-from-identifier()` to always return a different value for a different identifier, the reverse function `magically-get-identifier-from-file-content()` will have to be based on file content, and nothing else. It can't return a different value based on the path or revision you got the file from, because that would result in different identifiers returning a file with the same content. (Violating the original constraint.) – Ajedi32 Oct 13 '15 at 20:31
  • @Ian And if `magically-get-identifier-from-file-content`'s return value is based solely on file content (and we've just proven it _must_ be), then assuming there are files at different paths or revisions in your repo with the same content, then by definition there will be files at multiple revisions and paths in your repo which, when passed to `magically-get-identifier-from-file-content`, return the same identifier. See what I'm saying? – Ajedi32 Oct 13 '15 at 20:31
  • @Ian Oh, and just FYI, in the above example `git hash-object` is analogous to `magically-get-identifier-from-file-content`, and `git show` is analogous to `magically-get-file-content-from-identifier`. `git hash-object` will take in file content and give you a unique identifier, and `git show` will take in an identifier and give you file content. So either I've misunderstood what you want, or what you're asking for is logically and mathematically impossible. – Ajedi32 Oct 13 '15 at 20:37
  • 1
    @Ian: it's worth considering that each commit stores whole files, not changes. Given a single commit C you can't say that C "changed file F" unless/until you identify a second commit, call it O: "C changed F with respect to what's in O". For some (most?) commits there's an obvious "O": C's (single) parent commit. For merge commits there's at least two possible "O"s and, e.g., when rebasing, there's no obvious "O" at all for some cases. In any case I wonder if you're trying to solve some other problem you haven't really described, here, because this particular solution is all about content. – torek Oct 13 '15 at 21:20
  • I am operating under some assumptions I based on the link I posted in the question. In that example they show a git object containing a filename, within which is a reference to the git object containing the file contents. It sounds like I'm making a faulty assumption -- that the git object containing the filename is limited to only one such reference. – Ian Oct 13 '15 at 22:36
  • 1
    @Ian Huh? You mean the tree object? Yes, that does contain the filename, but if say, you had two files in a directory with the same contents, that tree object would list two names pointing to the same blob. Also, every commit can have a different tree object (though it doesn't necessarily have to), so there can also be multiple different tree objects pointing to the same blob. You can think of a tree object like a directory: it can contain other directories (tree objects), or files (blobs). The only real difference is that tree objects are immutable (like most things in git). – Ajedi32 Oct 15 '15 at 13:39