Filtering a diff with a regular expression

Question

It seems that it would be extremely handy to be able to filter a diff so that trivial changes are not displayed. I would like to write a regular expression which would be run on the line and then pass it another string that uses the captured arguments to generate a canonical form. If the lines before and after produce the same output, then they would be removed from the diff.

For example, I am working on a PHP code base where a significant number of array accesses are written as my_array[my_key] when they should be my_array["my_key"] to prevent issues if a my_key constant is defined. It would be useful to generate a diff where the only change on the line wasn't adding some quotes.

I can't change them all at once, as we don't have the resources to test the entire code base, so am fixing this whenever I make a change to a function. How can I achieve this? Is there anything else similar to this that I can use to achieve a similar result. For example, a simpler method might be to skip the canonical form and just see if the input is transformed into the output. BTW, I am using Git

"I can't change them all at once, as we don't have the resources to test the entire code base," - unless you actually have constants which might be used as array keys you can safely replace `[key]` with `['key']` without much testing. testing every file for parse errors it not that much work; you can easily automate it using `find` and the command-line `php` binary. — ThiefMaster, Nov 21 '11 at 23:41
@ThiefMaster: We may have constants that match keys - our codebase is huge! — Casebash, Nov 21 '11 at 23:45
Searching for `define(` could give you a list of all constants, then searching for `[nameofyourconstant` for all constants (again, easy to automate) would show you if/where they are used as array keys. — ThiefMaster, Nov 21 '11 at 23:46
@ThiefMaster: That sounds like a really good solution. Regardless, I am interested in whether there is a way to filter diffs — Casebash, Nov 21 '11 at 23:52
Related: https://stackoverflow.com/questions/12462538/how-to-grep-the-git-diff/76412336 — Marc Durdin, Jun 06 '23 at 07:45

score 10 · Answer 1 · answered Feb 16 '16 at 14:06

10

grepdiff can be used to filter the hunks in the diff file.

$ git diff -U1 | grepdiff 'console' --output-matching=hunk

It shows only the hunks that match with the given string "console".

answered Feb 16 '16 at 14:06

Naga Kiran

8,585
5
43
53

score 10 · Answer 2 · edited May 23 '17 at 12:32

10

$ git diff --help

-G<regex>
    Look for differences whose added or removed line matches the given <regex>.

EDIT:

After some tests I've got something like

git diff -b -w --word-diff-regex='.*\[[^"]*\]'

Then I've got output like:

diff --git a/test.php b/test.php
index 62a2de0..b76891f 100644
--- a/test.php
+++ b/test.php
@@ -1,3 +1,5 @@
<?php

{+$my_array[my_key]+} = "test";

?>
diff --git a/test1.php b/test1.php
index 62a2de0..6102fed 100644
--- a/test1.php
+++ b/test1.php
@@ -1,3 +1,5 @@
<?php

some_other_stuff();

?>

Maybe it will help you. I found it here http://www.rhinocerus.net/forum/lang-lisp/659593-git-word-diff-regex-lisp-source.html and there is more information on this thread

EDIT2:

git diff -G'\[[A-Za-z_]*\]' --pickaxe-regex

edited May 23 '17 at 12:32

Community

1
1

answered Nov 21 '11 at 23:20

Hauleth

22,873
4
61
112

That actually didn't display any of the options for me. Are they the same as the normal diff? – Casebash Nov 21 '11 at 23:28
1

@Casebash: see my question on that topic: http://stackoverflow.com/questions/5088907/how-do-i-use-git-diff-g – eckes Nov 24 '11 at 06:36
+1 for being interesting, but it isn't quite what I'm looking for – Casebash Nov 24 '11 at 12:10
I've found `--word-diff-regex` param so maybe it will be helpful. – Hauleth Nov 24 '11 at 13:17
Thanks, this is very interesting. It shows all changes to a word that either originally matched the pattern or was changed to match that pattern with all other changes treated as whitespace changes. So it is a better solution than -G when you want to see only changes matching a pattern, but still doesn't solve the filtering problem – Casebash Nov 26 '11 at 00:48

score 7 · Accepted Answer · answered Nov 28 '11 at 03:46

There does not seem to be any options to Git's diff command to support what you want to do. However, you could use the GIT_EXTERNAL_DIFF environment variable and a custom script (or any executable created using your preferred scripting or programming language) to manipulate a patch.

I'll assume you are on Linux; if not, you could tweak this concept to suit your environment. Let's say you have a Git repo where HEAD has a file file05 that contains:

line 26662: $my_array[my_key]

And a file file06 that contains:

line 19768: $my_array[my_key]
line 19769: $my_array[my_key]
line 19770: $my_array[my_key]
line 19771: $my_array[my_key]
line 19772: $my_array[my_key]
line 19773: $my_array[my_key]
line 19775: $my_array[my_key]
line 19776: $my_array[my_key]

You change file05 to:

line 26662: $my_array["my_key"]

And you change file06 to:

line 19768: $my_array[my_key]
line 19769: $my_array["my_key"]
line 19770: $my_array[my_key]
line 19771: $my_array[my_key]
line 19772: $my_array[my_key]
line 19773: $my_array[my_key]
line 19775: $my_array[my_key2]
line 19776: $my_array[my_key]

Using the following shell script, let's call it mydiff.sh and place it somewhere that's in our PATH:

#!/bin/bash
echo "$@"
git diff-files --patch --word-diff=porcelain "${5}" | awk '
/^-./ {rec = FNR; prev = substr($0, 2);}
FNR == rec + 1 && /^+./ {
    ln = substr($0, 2);
    gsub("\\[\"", "[", ln);
    gsub("\"\\]", "]", ln);
    if (prev == ln) {
        print " " ln;
    } else {
        print "-" prev;
        print "+" ln;
    }
}
FNR != rec && FNR != rec + 1 {print;}
'

Executing the command:

GIT_EXTERNAL_DIFF=mydiff.sh git --no-pager diff

Will output:

file05 /tmp/r2aBca_file05 d86525edcf5ec0157366ea6c41bc6e4965b3be1e 100644 file05 0000000000000000000000000000000000000000 100644
index d86525e..c2180dc 100644
--- a/file05
+++ b/file05
@@ -1 +1 @@
 line 26662: 
 $my_array[my_key]
~
file06 /tmp/2lgz7J_file06 d84a44f9a9aac6fb82e6ffb94db0eec5c575787d 100644 file06 0000000000000000000000000000000000000000 100644
index d84a44f..bc27446 100644
--- a/file06
+++ b/file06
@@ -1,8 +1,8 @@
 line 19768: $my_array[my_key]
~
 line 19769: 
 $my_array[my_key]
~
 line 19770: $my_array[my_key]
~
 line 19771: $my_array[my_key]
~
 line 19772: $my_array[my_key]
~
 line 19773: $my_array[my_key]
~
 line 19775: 
-$my_array[my_key]
+$my_array[my_key2]
~
 line 19776: $my_array[my_key]
~

This output does not show changes for the added quotes in file05 and file06. The external diff script basically uses the Git diff-files command to create the patch and filters the output through a GNU awk script to manipulate it. This sample script does not handle all the different combinations of old and new files mentioned for GIT_EXTERNAL_DIFF nor does it output a valid patch, but it should be enough to get you started.

You could use Perl regular expressions, Python difflib or whatever you're comfortable with to implement an external diff tool that suits your needs.

score 5 · Answer 4 · edited Jan 10 '12 at 08:15

from my own git --help

--word-diff-regex=<regex>

Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it was already enabled. Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!) for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline. The regex can also be set via a diff driver or configuration option, see gitattributes(1) or git-config(1). Giving it explicitly overrides any diff driver or configuration setting. Diff drivers override configuration settings.

I don't know how about you, but to my opinion, this should be the accepted answer. :) — jirislav, Apr 13 '18 at 07:08
After the earlier answers didn't quite work, I tried `git diff --word-diff-regex=^\s` and got exactly the result I wanted. — Bill Naylor, Sep 07 '20 at 08:54

score 1 · Answer 5 · answered Mar 12 '21 at 09:40

I provided an answer to a similar question here.

I made a git function to do this with the regex as the only input. Just enter this into your gitconfig (I used my global gitconfig) and use it with git regexadd <regex>.

[alias]
        regexadd = "!f() { git diff -U0 \
                | grepdiff -E $1 --output-matching=hunk \
                | git apply --cached --unidiff-zero; }; f"

score 1 · Answer 6 · answered Nov 26 '11 at 11:48

Normalize the input files in a first step, then compare the normalized files. This gives you most control over the process. E.g. you might want to only apply the regexp to non-HTML parts of the code, not inside of strings, not inside of comments (or ignore comments altogether). Computing a diff on the normalized code is the proper way to do such things; working with regexps on single lines is much more error-prone and at most a hack.

Some diff utilities such as e.g. meld allow hiding "insignificant" difference, and come with a set of default patterns to e.g. hide whitespace-only changes. This is pretty much what you want, I guess.

score 0 · Answer 7 · answered Nov 24 '11 at 06:40

I use an approach that combines git diff and applying a regular expression matching on the results. In some testing code (PERL), I know that testing is successful when the OutputFingerprint stored in the resulting files of the tests has not changed.

First, I do a

my $matches = `git diff -- mytestfile`

and then evaluate the result:

if($matches =~ /OutputFingerprint/){
  fail();
  return 1;
}else{
  ok();
  return 0;
}

score -4 · Answer 8 · edited Nov 26 '11 at 21:52

If the goal is minimize trivial differences, you might consider our SmartDifferencer tool.

These tools compare the language syntax, not the layout, so many trivial changes (layout, modified comments, even changed radix on numbers) are ignored and not reported. Each tool has a full language parser; there's a version for many languages, including PHP.

It won't handle the example $FOO[abc] as being "semantically identical" to $FOO["abc"], because they are not. If abc actaully has a definition as as constant, then $FOO["abc"] is not semantically equivalent.

Filtering a diff with a regular expression

8 Answers8

Linked

Related