38

I am trying to produce a list of the files that were changed in a specific commit. The problem is, that every file has the version number in a comment at the top of the file - and since this commit introduces a new version, that means that every file has changed.

I don't care about the changed comments, so I would like to have git diff ignore all lines that match ^\s*\*.*$, as these are all comments (part of /* */).

I cannot find any way to tell git diff to ignore specific lines.

I have already tried setting a textconv attribute to cause Git to pass the files to sed before diffing them, so that sed can strip out the offending lines - the problem with this, is that git diff --name-status does not actually diff the files, just compares the hashes, and of course all the hashes have changed.

Is there a way to do this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Benubird
  • 18,551
  • 27
  • 90
  • 141
  • A wild guess... Did you try `git diff --name-status --textconv`? Or maybe `git diff --name-only`? – rodrigo May 13 '13 at 16:55
  • Yes, I am using --name-only, but it returns (like I said), every file, because every files has had its comments changed. --textconv does not work, because, as I also said in the post, git ignores it when not producing a full diff. – Benubird May 13 '13 at 17:04
  • 1
    possible duplicate of [ignoring changes matching a string in git diff](http://stackoverflow.com/questions/15878622/ignoring-changes-matching-a-string-in-git-diff) – richvdh Jun 23 '15 at 14:08
  • 1
    @richvdh I think the questions are similar enough to be considered a duplicate, BUT they have different correct answers, and this question has additional answers making suggestions that the other Q does not have, so I believe there is value in keeping both of them. – Benubird Jun 23 '15 at 15:23
  • 1
    Related: Git 2.30 (Q1 2021) will propose [`git diff -I`](https://stackoverflow.com/a/64758633/6309). – VonC Nov 09 '20 at 20:16

8 Answers8

22

Here is a solution that is working well for me. I've written up the solution and some additional missing documentation on the git (log|diff) -G<regex> option.

It is basically using the same solution as in previous answers, but specifically for comments that start with a * or a #, and sometimes a space before the *... But it still needs to allow #ifdef, #include, etc. changes.

Look ahead and look behind do not seem to be supported by the -G option, nor does the ? in general, and I have had problems with using *, too. + seems to be working well, though.

(Note, tested on Git v2.7.0)

Multi-Line Comment Version

git diff -w -G'(^[^\*# /])|(^#\w)|(^\s+[^\*#/])'
  • -w ignore whitespace
  • -G only show diff lines that match the following regex
  • (^[^\*# /]) any line that does not start with a star or a hash or a space
  • (^#\w) any line that starts with # followed by a letter
  • (^\s+[^\*#/]) any line that starts with some whitespace followed by a comment character

Basically an SVN hook modifies every file in and out right now and modifies multi-line comment blocks on every file. Now I can diff my changes against SVN without the FYI information that SVN drops in the comments.

Technically this will allow for Python and Bash comments like #TODO to be shown in the diff, and if a division operator started on a new line in C++ it could be ignored:

a = b
    / c;

Also the documentation on -G in Git seemed pretty lacking, so the information here should help:

git diff -G<regex>

-G<regex>

Look for differences whose patch text contains added/removed lines that match <regex>.

To illustrate the difference between -S<regex> --pickaxe-regex and -G<regex>, consider a commit with the following diff in the same file:

+    return !regexec(regexp, two->ptr, 1, &regmatch, 0);
...
-    hit = !regexec(regexp, mf2.ptr, 1, &regmatch, 0);

While git log -G"regexec\(regexp" will show this commit, git log -S"regexec\(regexp" --pickaxe-regex will not (because the number of occurrences of that string did not change).

See the pickaxe entry in gitdiffcore(7) for more information.

(Note, tested on Git v2.7.0)

  • -G uses a basic regular expression.
  • No support for ?, *, !, {, } regular expression syntax.
  • Grouping with () and OR-ing groups works with |.
  • Wild card characters such as \s, \W, etc. are supported.
  • Look-ahead and look-behind are not supported.
  • Beginning and ending line anchors ^$ work.
  • Feature has been available since Git 1.7.4.

Excluded Files v Excluded Diffs

Note that the -G option filters the files that will be diffed.

But if a file gets "diffed" those lines that were "excluded/included" before will all be shown in the diff.

Examples

Only show file differences with at least one line that mentions foo.

git diff -G'foo'

Show file differences for everything except lines that start with a #

git diff -G'^[^#]'

Show files that have differences mentioning FIXME or TODO

git diff -G`(FIXME)|(TODO)`

See also git log -G, git grep, git log -S, --pickaxe-regex, and --pickaxe-all

UPDATE: Which regular expression tool is in use by the -G option?

https://github.com/git/git/search?utf8=%E2%9C%93&q=regcomp&type=

https://github.com/git/git/blob/master/diffcore-pickaxe.c

if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
    int cflags = REG_EXTENDED | REG_NEWLINE;
    if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
        cflags |= REG_ICASE;
    regcomp_or_die(&regex, needle, cflags);
    regexp = &regex;

// and in the regcom_or_die function
regcomp(regex, needle, cflags);

http://man7.org/linux/man-pages/man3/regexec.3.html

   REG_EXTENDED
          Use POSIX Extended Regular Expression syntax when interpreting
          regex.  If not set, POSIX Basic Regular Expression syntax is
          used.

// ...

   REG_NEWLINE
          Match-any-character operators don't match a newline.

          A nonmatching list ([^...])  not containing a newline does not
          match a newline.

          Match-beginning-of-line operator (^) matches the empty string
          immediately after a newline, regardless of whether eflags, the
          execution flags of regexec(), contains REG_NOTBOL.

          Match-end-of-line operator ($) matches the empty string
          immediately before a newline, regardless of whether eflags
          contains REG_NOTEOL.
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
phyatt
  • 18,472
  • 5
  • 61
  • 80
  • It looks like it is similar to "Simple Regular Expression". https://en.wikibooks.org/wiki/Regular_Expressions/Simple_Regular_Expressions – phyatt Apr 24 '17 at 12:52
  • That couldn't be completely right since it accepts some non-simple syntax such as `+` (I just tested). – Emadpres Apr 24 '17 at 13:18
  • See update near the end of my answer. I haven't successfully tested the "POSIX extended regular expressions". My empirical testing showed it not working quite the same. – phyatt Apr 24 '17 at 13:51
  • @phyatt - this does not sem to work: `git diff -G'^[^#]'`. It still shows lines starting with `#`. – Martin Vegter Aug 24 '19 at 05:54
  • @MartinVegter The syntax will still show up if the file has at least one other difference. If a file only has comment differences, the file will be excluded in the results. – phyatt Aug 24 '19 at 12:17
  • @MartinVegter, is there a way to prevent the file, the comment changes, from showing up if there are other changes? i.e. if -G matches the change with a comment to exclude, but then there's another change in the same file later on, how do can you still prevent the comment change from appearing in the `git diff` output? – bretonics Jul 13 '21 at 19:01
  • @bretonics I would try something like piping the operation into `grep -v` with the kind of line you want to ignore, but that is still a partial solution. This may be something to request to the git maintainers on their mailing list or make a pull request for it. – phyatt Jul 13 '21 at 23:28
  • Is it somehow possible to pass-through binary files (for the purposes of showing them as changed in the diffstat)? – creanion Mar 24 '22 at 10:33
15
git diff -G <regex>

And specify a regular expression that does not match your version number line.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
riezebosch
  • 1,950
  • 16
  • 29
10

I found it easiest to use git difftool to launch an external diff tool:

git difftool -y -x "diff -I '<regex>'"
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
richvdh
  • 1,163
  • 11
  • 19
4

I found a solution. I can use this command:

git diff --numstat --minimal <commit> <commit> | sed '/^[1-]\s\+[1-]\s\+.*/d'

To show the files that have more than one line changed between commits, which eliminates files whose only change was the version number in the comments.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Benubird
  • 18,551
  • 27
  • 90
  • 141
2

Using 'grep' on the 'git diff' output,

git diff -w | grep -c -E "(^[+-]\s*(\/)?\*)|(^[+-]\s*\/\/)"

comment line changes alone can be calculated. (A)

Using 'git diff --stat' output,

git diff -w --stat

all line changes can be calculated. (B)

To get non comment source line changes (NCSL) count, subtract (A) from (B).

Explanation:

In the 'git diff ' output (in which whitespace changes are ignored),

  • Look out for a line which start with either '+' or '-', which means modified line.
  • There can be optional white-space characters following this. '\s*'
  • Then look for comment line pattern '/*' (or) just '*' (or) '//'.
  • Since, '-c' option is given with grep, just print the count. Remove '-c' option to see the comments alone in the diffs.

NOTE: There can be minor errors in the comment line count due to following assumptions, and the result should be taken as a ballpark figure.

  • 1.) Source files are based on the C language. Makefile and shell script files have a different convention, '#', to denote the comment lines and if they are part of diffset, their comment lines won't be counted.

  • 2.) The Git convention of line change: If a line is modified, Git sees it as that particular line is deleted and a new line is inserted there and it may look like two lines are changed whereas in reality one line is modified.

     In the below example, the new definition of 'FOO' looks like a two-line change.
    
     $  git diff --stat -w abc.h
     ...
     -#define FOO 7
     +#define FOO 105
     ...
     1 files changed, 1 insertions(+), 1 deletions(-)
     $
    
  • 3.) Valid comment lines not matching the pattern (or) Valid source code lines matching the pattern can cause errors in the calculation.

In the below example, the "+ blah blah" line which doesn't start with '*' won't be detected as a comment line.

           + /*
           +  blah blah
           + *
           + */

In the below example, the "+ *ptr" line will be counted as a comment line as it starts with *, though it is a valid source code line.

            + printf("\n %p",
            +         *ptr);
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
1

For most languages, to do it correctly, you have to parse the original source file/ast, and exclude comments that way.

One reason is that the start of multi-line comments might not be covered by the diff. Another reason is that language-parsing isn't trivial, and there are often things that can trip up a naive parser.

I was going to do that for python, but string-hacking was good enough for my needs.

For python, you can ignore comments and attempt-to-ignore docstrings using a custom filter, such as this:


#!/usr/bin/env python

import sys
import re
import configparser
from fnmatch import fnmatch
from unidiff import PatchSet

EXTS = ["py"]


class Opts:  # pylint: disable=too-few-public-methods
    debug = False
    exclude = []


def filtered_hunks(fil):
    path_re = ".*[.](%s)$" % "|".join(EXTS)
    for patch in PatchSet(fil):
        if not re.match(path_re, patch.path):
            continue
        excluded = False
        if Opts.exclude:
            if Opts.debug:
                print(">", patch.path, "=~", Opts.exclude)
            for ex in Opts.exclude:
                if fnmatch(patch.path, ex):
                    excluded = True
        if excluded:
            continue
        for hunk in patch:
            yield hunk


class Typ:  # pylint: disable=too-few-public-methods
    LINE = "."
    COMMENT = "#"
    DOCSTRING = "d"
    WHITE = "w"


def classify_lines(fil):
    for hunk in filtered_hunks(fil):
        yield from classify_hunk(hunk)


def classify_line(lval):
    """Classify a single python line, noting comments, best efforts at docstring start/stop and pure-whitespace."""
    lval = lval.rstrip("\n\r")
    remaining_lval = lval
    typ = Typ.LINE
    if re.match(r"^ *$", lval):
        return Typ.WHITE, None, ""

    if re.match(r"^ *#", lval):
        typ = Typ.COMMENT
        remaining_lval = ""
    else:
        slug = re.match(r"^ *(\"\"\"|''')(.*)", lval)
        if slug:
            remaining_lval = slug[2]
            slug = slug[1]
            return Typ.DOCSTRING, slug, remaining_lval
    return typ, None, remaining_lval


def classify_hunk(hunk):
    """Classify lines of a python diff-hunk, attempting to note comments and docstrings.

    Ignores context lines.
    Docstring detection is not guaranteed (changes in the middle of large docstrings won't have starts.)
    Using ast would fix, but seems like overkill, and cannot be done on a diff-only.
    """

    p = ""
    prev_typ = 0
    pslug = None
    for line in hunk:
        lval = line.value
        lval = lval.rstrip("\n\r")
        typ = Typ.LINE
        naive_typ, slug, remaining_lval = classify_line(lval)
        if p and p[-1] == "\\":
            typ = prev_typ
        else:
            if prev_typ != Typ.DOCSTRING and naive_typ == Typ.COMMENT:
                typ = naive_typ
            elif naive_typ == Typ.DOCSTRING:
                if prev_typ == Typ.DOCSTRING and pslug == slug:
                    # remainder of line could have stuff on it
                    typ, _, _ = classify_line(remaining_lval)
                else:
                    typ = Typ.DOCSTRING
                    pslug = slug
            elif prev_typ == Typ.DOCSTRING:
                # continue docstring found in this context/hunk
                typ = Typ.DOCSTRING

        p = lval
        prev_typ = typ

        if typ == Typ.DOCSTRING:
            if re.match(r"(%s) *$" % pslug, remaining_lval):
                prev_typ = Typ.LINE

        if line.is_context:
            continue

        yield typ, lval


def count_lines(fil):
    """Totals changed lines of python code, attempting to strip comments and docstrings.

    Deletes/adds are counted equally.
    Could miss some things, don't rely on exact counts.
    """

    count = 0

    for (typ, line) in classify_lines(fil):
        if Opts.debug:
            print(typ, line)
        if typ == Typ.LINE:
            count += 1

    return count


def main():
    Opts.debug = "--debug" in sys.argv
    Opts.exclude = []

    use_covrc = "--covrc" in sys.argv

    if use_covrc:
        config = configparser.ConfigParser()
        config.read(".coveragerc")
        cfg = {s: dict(config.items(s)) for s in config.sections()}
        exclude = cfg.get("report", {}).get("omit", [])
        Opts.exclude = [f.strip() for f in exclude.split("\n") if f.strip()]

    for i in range(len(sys.argv)):
        if sys.argv[i] == "--exclude":
            Opts.exclude.append(sys.argv[i + 1])

    if Opts.debug and Opts.exclude:
        print("--exclude", Opts.exclude)

    print(count_lines(sys.stdin))


example = '''
diff --git a/cryptvfs.py b/cryptvfs.py
index c68429cf6..ee90ecea8 100755
--- a/cryptvfs.py
+++ b/cryptvfs.py
@@ -2,5 +2,17 @@

 from src.main import proc_entry

-if __name__ == "__main__":
-    proc_entry()
+
+
+class Foo:
+    """some docstring
+    """
+    # some comment
+    pass
+
+class Bar:
+    """some docstring
+    """
+    # some comment
+    def method():
+        line1 + 1
'''


def strio(s):
    import io

    return io.StringIO(s)


def test_basic():
    assert count_lines(strio(example)) == 10


def test_main(capsys):
    sys.argv = []
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert cap.out == "10\n"


def test_debug(capsys):
    sys.argv = ["--debug"]
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert Typ.DOCSTRING + '     """some docstring' in cap.out


def test_exclude(capsys):
    sys.argv = ["--exclude", "cryptvfs.py"]
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert cap.out == "0\n"


def test_covrc(capsys):
    sys.argv = ["--covrc"]
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert cap.out == "10\n"


if __name__ == "__main__":
    main()

That code can be trivially modified to produce filenames, rather than counts.

But it can, of course, mistakenly count parts of docstrings as "code" (which is isn't for things like coverage, etc).

adamency
  • 682
  • 6
  • 13
Erik Aronesty
  • 11,620
  • 5
  • 64
  • 44
0

Perhaps a Bash script like this:

#!/bin/bash
git diff --name-only "$@" | while read FPATH ; do
    LINES_COUNT=`git diff --textconv "$FPATH" "$@" | sed '/^[1-]\s\+[1-]\s\+.*/d' | wc -l`
    if [ $LINES_COUNT -gt 0 ] ; then
        echo -e "$LINES_COUNT\t$FPATH"
    fi
done | sort -n
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
saeedgnu
  • 4,110
  • 2
  • 31
  • 48
0

I use meld as the tool to ignore comments by setting its options, then use meld as difftool:

git difftool --tool=meld -y
buffy
  • 71
  • 2
  • 8