1

If a tarball (a .tgz file) is tracked in a Git repo, how does Git know if it has changed between commits?

I am looking to copy that behavior/functionality, so I can determine if there are changes between two different tarballs.

Again, what am I trying to do? I want to create a script that can diff tarballs, without having to use git

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Alexander Mills
  • 90,741
  • 139
  • 482
  • 817
  • 3
    What exactly are you asking? Git does not treat a tar/tgz file any different from any other file. If it has changed, it has changed. Simplest way to find out: compare byte by byte. Git stores a hash for each file (a "blob" in Git lingo) and only needs to compare hashes to detect changes in files – knittl Aug 05 '18 at 18:13
  • Fine then, show me how to do that with a few bash commands, and I can pretty much guarantee some upvotes – Alexander Mills Aug 05 '18 at 18:25
  • Do you want to implement a program that diffs tarballs or create a script that uses `git diff` on committed tarballs? – kelvin Aug 05 '18 at 18:28
  • 4
    I still don't fully understand what you want to do or where you face problems. Compare two tarballs: `git diff --no-index file1.tgz file2.tgz` or just normal `diff`. Alternatively, compute the checksums and compare the checksum: `test $(sha1sum – knittl Aug 05 '18 at 18:31
  • How do you not understand what I am asking? You just answered the question. Can you add that as an answer and then explain why they are correct? For example, I don't know what `sha1sum x.tgz` actually does. Does it go through all the files in the tarball or..? – Alexander Mills Aug 05 '18 at 18:48
  • 1
    @kelvin yep I want to create a script that can diff tarballs, without having to use git – Alexander Mills Aug 05 '18 at 19:20
  • @AlexanderMills To be honest, you are confusing the issue by bringing Git into the question at all. – chepner Aug 05 '18 at 19:23
  • You might be right chepner, but I have tried to get an answer to this question in multiple ways, and formulating the question this way seemed to be the only option remaining. @knittl's comment is honestly the closest I have come to getting an answer to the OP. – Alexander Mills Aug 05 '18 at 19:27
  • 1
    It's a bit funny that the original post was already tagged with `sha1sum`, which is one possible answer – knittl Aug 05 '18 at 19:29
  • If I knew the answer I honestly wouldn't be asking the question, I think this is an art as much as it is a science because ultimately I just want to diff the contents of two tarballs, whilst ignoring file modification dates, but not ignoring file permissions, etc etc. I just don't know enough about how sha1sum or checksums work to come to an answer myself. – Alexander Mills Aug 05 '18 at 19:31

1 Answers1

2

Git knows if a tar file has changed the same way it detects if other files have changed: it compares the contents of the file. This may be as naïve as comparing them byte by byte or by computing a hash of the file first and then comparing the hashes. Since Git internally stores all known files with their hash, this can be used instead of doing the expensive byte-by-byte comparison.

To make use of the functionality, you could simply use Git itself to compare any two files on your filee system:

git diff --no-index file1.tgz file2.tgz

Or, if you don't have Git available, you could use the plain diff command instead.

Another option would be to manually compute checksums of the two files and compare the checksums instead. If the checksums are different, then the files are guaranteed to be different. If the checksums are identical, it is very likely that the file contents are also identical, but there's still the probability of hash collisions, so to be certain, you'd then have to compare the files byte-by-byte.

A simple way to compute and compare checksums of two files would be the following:

test "$(sha1sum <file1)" = "$(sha1sum <file2)"

Note the IO redirect, so that the output is the same even if the files have different file names.

You can of course use any other hashing algorithm such as sha256sum

knittl
  • 246,190
  • 53
  • 318
  • 364
  • how does `sha1sum x.tgz` differ from `tar -xOzf x.tgz | sort | sha1sum`? They result in different values for a given tarball. – Alexander Mills Aug 05 '18 at 19:29
  • 1
    `sha1sum file` computes the SHA1 hash of the (potentially compressed) file, your other command extracts all files of the archive to stdout, sorts all lines in the output, and then computes the hash of the sorted output. Why would that give identical results? – knittl Aug 05 '18 at 19:31
  • Well, that is the question. If we want to check to see if the contents of two tarballs is the same, it seems "unfair" to just run `sha1sum x.tgz`. The reason is I assume that that command will include information about file modification times, or extraneous info that I don't want to take into account, whatever that might be. I ultimately just care about the contents of the files, and the file permissions. – Alexander Mills Aug 05 '18 at 19:35
  • 1
    Then you probably should have specified that in your question. Git doesn't do anything magically and cannot know what parts of the file you consider relevant for the file to be considered _changed_. If a single bit in the tar file has a different value, then the file is a different file. – knittl Aug 05 '18 at 19:37
  • That makes sense. Do you happen to know if the filenames matter when running `test "$(sha1sum – Alexander Mills Aug 05 '18 at 19:40
  • Yes, it matters. If a single bit is different in the files, then the files are different files. Even if it weren't, I would consider two tarballs different tarballs if they have files with different names in them. – knittl Aug 05 '18 at 19:41
  • Yep, I was just making sure that different filenames would mean a different hash. I need to read more about how files are actually stored on the fs, but I assume it's different when they are tarballed, that's what makes this a little confusing. – Alexander Mills Aug 05 '18 at 19:44
  • thanks, I will probably accept this answer tomorrow, I doubt I will get a better one – Alexander Mills Aug 05 '18 at 19:46
  • the one thing more I would ask is - do you think there is a better way to compare two tarballs than `test "$(sha1sum – Alexander Mills Aug 05 '18 at 19:49
  • Again, computing the sha1 sum will give different output for files that differ in a single bit. Whatever you consider "meaningless" is subjective. You cannot teach the `sha1sum` tool to ignore certain aspects of its input. If you want more control over the comparison, write your own program which performs exactly the steps that you require and ignores everything else. If you want to know if `sha1sum` suits your needs, run it against some sample files of yours which you consider identical and some others which you consider different and verify the expected vs the actual behavior. – knittl Aug 05 '18 at 19:53
  • Ok so sha1sum just reads in bits from start to finish and the order matters obviously and that's how it generates the hash (somehow). I guess that goes back to my question about how `tar -xOzf x.tgz | sort | sha1sum` differs from `sha1sum x.tgz`. Understanding that difference would help understand more about how tarballs are structured or something, to find out if there is any info in the tarball that I can safely ignore. – Alexander Mills Aug 05 '18 at 20:03