2

I use git's word diffing to find changes between texts on a per-character basis:

git diff --word-diff=porcelain --word-diff-regex='\[[^]]*\]?|.' --no-index original.txt changed.txt

(If you're wondering, the custom regex I use ensures that characters within brackets are never broken up – credit to jthill.)

The resulting diff does not indicate deletions or additions of newlines (neither with nor without my custom regex). And when I replace a newline with, say, a space, it only indicates the addition of the space, not the deletion of the the newline.

Given the following original

foo

bar

baz

and the following changed text (I removed one line break in the top half and added one in the bottom half)

foo
bar


baz

I get this porcelain-style diff, where ~ represents newlines:

@@ -1,5 +1,5 @@
 foo
~
~
 bar
~
 
~
~
 baz
~

But I want the following diff:

@@ <whatever> @@
 foo
-\n
~
 bar
~
~
+\n
baz
~

I have tried adding |\n to my regex, to no avail. (Btw git uses POSIX "extended" regular expressions.) The docs say that "[a] match that contains a newline is silently truncated(!) at the newline." I don't fully understand what this means but I suspect it could be the cause of the issue.

Is there any way to get git to produce the desired diff?

Dennis Hackethal
  • 13,662
  • 12
  • 66
  • 115
  • The way word diff works is a hack: it post-processes the line-oriented regular diff output. That line-oriented output already treats newlines specially, so word-diff has to discard anything that tries to make use of newlines in any other way. I don't think you can get what you want from Git's word-diff. – torek Nov 22 '22 at 10:28
  • I think if you really need this you're going to need to *pre*process the text, adding newlines after every whitespace stretch in a line and another after every newline. Then the ordinary line diff will be diffing words+any trailing white space and original-source newlines will show up as a completely empty line. Then do the ordinary diff, then reconstruct how the diff would have looked in the original source. This seems like an awful lot of effort and expense to check for a single character at a known place in the file. – jthill Nov 26 '22 at 15:13

1 Answers1

0

At the current state, Git does not allow newlines to be words [1] but I'm hoping there's a more elegant solution than this involving tweaking the git settings. Regardless, here's a preprocessor-based solution:

sed -ze "s/\n/$(echo -ne '\ufffd')\n/g" original.txt > temp1.txt
sed -ze "s/\n/$(echo -ne '\ufffd')\n/g" changed.txt > temp2.txt
git diff --word-diff=porcelain --word-diff-regex='\[[^]]*\]?|.' --no-index temp1.txt temp2.txt | sed -zE "s/([^\+\-])$(echo -ne '\ufffd')/\1/g" | sed -ze "s/$(echo -ne '\ufffd')\n~\?/\\\n/g"
rm -rf temp1.txt temp2.txt

Basically it

  1. Replaces "\n" with "\n\ufffd" (appends a temporary unicode character) outputting temp1.txt and temp2.txt.
    • According to [2] there isn't yet a known way to git diff two string inputs with the latest version of git (not requiring .git) which is why temporary files are used rather than a one-liner.
  2. Then git diff the two files.
    • Removes any "\ufffd" that doesn't follow a "+" or "-"
    • Then replaces the remaining ones with "\n"
  3. Then clean up the intermediate files

Output:

@@ -1,5 +1,5 @@
 foo
~
-\n
 bar
~
 
~
+\n
 baz
~

Assumptions: The unicode character must not exist in the initial files, making the solution less elegant.

The git diff selected the second "~" to put "-\n", that's natural git and shouldn't change the output.

Adding | sed -ze "s/\(\n *\)\+/\n/g" to the end of line 3 will remove the double white space in the middle, but again this would deviate from git diff's natural output.


For additional research, word boundaries are computed in git's code at [3], which is called at [4] where the \n delimiter is hardcoded.

Simeon
  • 157
  • 1
  • 7