1

I'm trying to export all files with differences between two commits, those differences being:

  • New files (Added)
  • Modified files
  • Renamed files
  • If possible, information on any deleted files

Detecting renames may be a tough one as I will be doing the exporting on a Windows 7 environment and hence somefile.php is the same as SomeFile.php; but I will be uploading them to a *nix environment, which does treat those files as being different, so they are needed to be recognized and exported if possible.

I was using the below command:

git diff-tree -r --no-commit-id --name-only --diff-filter=ACMRT $head_commit_id $older_commit_id | xargs tar -cf project.tar -T -

However I noticed it was not exporting new/added files and also was not exporting renamed files; I then found out that git diff-tree doesn't do rename detection by default, so from what I can see I would need to add --find-renames to the command?

David Jones
  • 4,766
  • 3
  • 32
  • 45
Brett
  • 19,449
  • 54
  • 157
  • 290
  • You are not getting added files because, as I told you earlier, you are doing your diffs *backwards*. Files that were actually added will, in this backwards diff, show up as *deleted*. Incidentally—something I failed to mention in your earlier question—you should be able to direct Git to pretend, for the during of your diff-ing, that the OS *is* case sensitive by using `git -c core.ignorecase=false diff-tree ...`. – torek Feb 25 '17 at 18:39
  • @torek Could you explain how I am doing it _backwards_? Is it because of the position of the `SHA1 SHA2` id's? Meaning, this part: `$head_commit_id $older_commit_id`? Also, thanks for the tip about case. – Brett Feb 26 '17 at 20:35
  • Yes: `git diff` takes (in this case) two arguments `` and `` and produces its best guess at the shortest set of instructions that, if followed, will transform an existing `` snapshot into a `` snapshot. "Shortest" is of course modifiable (via allowing options like rename or copy, or through selection of `--diff-algorithm=` and the new `--compaction-heuristic`). So if `HEAD` is newer than `$old`, `git diff HEAD $old` tells you how to turn the newer HEAD commit into the old one. – torek Feb 26 '17 at 21:28
  • To summarize all these comments: Git *doesn't tell you what you actually did*. Its various diffs produce *instructions on how to change one commit to another*. You choose the time-direction of these instructions by which commit you list first, and which you list second. – torek Feb 26 '17 at 21:44
  • @torek Ok great - thanks again! – Brett Feb 27 '17 at 16:02

2 Answers2

2

Why not using this simple status command:

git diff --name-only SHA1 SHA2
# or

# --name-status will display the name and the status of the files
git diff --name-status SHA1 SHA2

# To display untracked files use the -u
git status -u

And in git you should rename files only with the git mv command.

CodeWizard
  • 128,036
  • 21
  • 144
  • 167
  • Something that simple will do everything I want? :) ....also, when I rename files I do it within my IDE with the _refactor_ command, which I believe runs it through the VCS since I have it integrated with Git as the VCS. – Brett Feb 25 '17 at 15:59
  • Ok I tried it........ worked well except for one problem. It included the renamed files, but it _also_ included the former files that got renamed; so for example; it included _myFile.php_ as well as _MyFile.php_ when it should have only included the new one that it was renamed to, that being _MyFile.php_. – Brett Feb 25 '17 at 16:46
  • It should have them both to show you what was the current commited file name. if you did not do git mv (which i assume your IDE does) you should not see moved file but you will see delete and add new file. – CodeWizard Feb 25 '17 at 16:48
  • Hmmmm ok....... no way to get rid of them, as I export them to a tar file and don't really need them both - other than some kind of report in a file if possible. – Brett Feb 25 '17 at 16:52
  • I'll take that as a no, there is no way to get rid of them? ...and yes, the answer is helpful, but it still contains unwanted data so I will wait a little while to see if any other answers come in before accepting :) – Brett Feb 25 '17 at 17:31
  • Sure. Thank you very much – CodeWizard Feb 25 '17 at 17:34
  • 1
    See the `--diff-filter` option, you can exclude deleted files with `--diff-filter=d`. – jthill Feb 25 '17 at 18:40
  • 1
    @CodeWizard: it does not matter whether you use `git mv`, as Git does rename *detection* at `git diff` time, rather than directly recording modifications to pathnames. – torek Feb 25 '17 at 18:46
  • correct me if in wrong but when you run git status beyond the scenes there are 4 `git diff` running. When i rename file via `mv` instead of `git mv` git mark it as deleted + new file. This is what i mean – CodeWizard Feb 25 '17 at 18:49
  • 1
    `git status` runs *two* `git diff`s, one from `HEAD` to index, and one from index to work-tree. Once you record the addition of the new name, and deletion of the old name, in your index, though, the output of `git status` will change. What `git mv` does is do all that for you at once, conveniently. (And, since Brett is running diffs of two commits, there is no index and work-tree complication.) – torek Feb 25 '17 at 18:51
2

As in CodeWizard's answer, you can use the "user-friendly" (or porcelain) command git diff instead of git diff-tree, which is what Git calls a plumbing command, meant for use in scripts. You should, however, be aware of what this means.

Since porcelain commands are meant for humans, they try to present things in human-readable fashion. This means they obey any setting that the one human in particular has set for himself/herself, in the various configuration files. That includes the diff.renames and diff.renameLimit configurations. They may also modify their output to make it easier for eyeballs, yet harder for computer programs, to deal with. Worst, they may change their output from one Git version to another, if people seem to prefer some default.

Since scripts are not meant for the above, they behave in predictable ways, with output that does not change, nor depend on configuration items. That way, whatever you request, you get: you will get reliable output in a reliable form, so that if you write your own reliable code, it will not just work today, for one case; it will keep working in the future, for all cases where it can.1

In the end, what this means is that if you use git diff-tree and set the right flags, you will get more reliable output. If you use git diff, your rename detection depends on:

As you discovered, the output from rename-detection is two pathnames, which is not something you can just pipe to an archiver. Archivers in general have issues with file deletion (this is, perhaps, one classic difference between archives and backups / snapshots; note that both of these are related to version control as well).

If your goal is a sort of union of all files—i.e., if the diff says that a file named A was added, one named D was deleted, and file R was created by renaming the old name O (and perhaps also modifying it: note Git's similarity index number that comes after the letter R), then you wish to collect file A, ignore file D, and collect file R while ignoring file O—well, then, what you want is to not detect renames in the first place! If you do not detect renames—which git diff-tree does not by default—this same diff will be presented as: add file A, delete file D, delete file O, and add file R. So a git diff-tree with a diff-filter that includes AM and excludes D suffices. It is less clear what to do with T, which is for a type-change: from ordinary file to symbolic link, for instance, or from file to sub-repository commit hash (what Git calls a gitlink entry, for a submodule).

Similarly, you don't want to enable copy detection: a C status, like R, presents a similarity index and a pair of pathnames. If you leave it disabled, you simply get the new pathname as an Added file.

Even if you do all this, you are still set up for a pitfall. Suppose that commit hash C1 has a file named problem, and a (presumably later) commit hash C2 has instead two files named problem/A and problem/B. This implies that the original file problem was deleted between these two points, because most systems (including Git itself) forbid having both a file named problem and a directory named problem holding various files. Given that each tar-archive itself is not a complete snapshot—you omit files that are unmodified between C1 and C2—your procedure for extracting these snapshots must necessarily be additive: extract earlier snapshot, then extract later snapshot atop earlier snapshot. This process will fail at the point where file problem is in the way of creating directory problem. Obviously, you can check for such problems and remove the problematic file (you can see now why I named the file problem :-) ), but more generally, since you are not storing "delete" directives in the first place, you won't know, in a future case where you are using these archives to rebuild a snapshot, that some files don't belong in that snapshot at all.

(The classic solution to this problem is to prefix update-archives with some kind of manifest or directive. If you decide to use such a solution, then, depending on the kind of detail you want in the manifest-or-directive, you might want to do a first pass to detect exact renames and/or exact copies.)


1Obviously, newly added features can present problems for everyone, not just scripts and not just humans, but the Git folks do work hard on not creating unnecessary problems for scripts that rely on plumbing commands. Consider, for instance, the new impetus to push Git toward using some flavor of SHA-256 instead of, or in addition to, SHA-1. Since SHA-1 produces 160-bit hashes, and SHA-256 produces 256 bit hashes, these must be represented as 40 and 64 hexadecimal digits respectively. Linus suggested abbreviating 256-bit hashes to 40 characters by default, to help out existing scripts that assume 40 characters, but I foresee some problems... :-)

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775
  • That's certainly a lot of information to digest - but definitely an informative read, thanks! – Brett Feb 26 '17 at 20:48
  • However when you say I would be alright with using `--diff-filter` with `AM` and excluding `D`, I assume you mean to use the lowercase `d` as per your comment on CodeWizards answer? – Brett Feb 26 '17 at 20:52
  • 1
    No: the point here is that `tar` cannot archive a deletion. It has no way to carry out that instruction. (And when taking files from the work-tree, as you are, the "to" commit in the diff must be the one checked-out in the work-tree as HEAD, so deleted files will just produce an error message that tar cannot find them.) Uppercase letters for `--diff-filter` are additive: `AM` means "give me added or modified" (but nothing else), or `AMT` = "added, modified, or type-changed." Lowercase letters were introduced in Git 1.8.5 and are "all-but"s, e.g., `--diff-filter=d` means all *except* deletes. – torek Feb 26 '17 at 21:37
  • (Also, that was jthill's comment. :-) ) I generally prefer the additive ones, but if you use `git diff-tree` so that you know you will never see Rename and Copy, lowercase `d` instead of uppercase `AMT` would suffice as ADMT should be the only possible outputs. If you do allow Renames, though, an earlier diff's "delete oldname, add newname" instruction pair becomes, in the new diff instructions, "rename from oldname to newname". – torek Feb 26 '17 at 21:41