1

I'm trying to obtain changes between commits for a large number of HTML documents, but I quickly noticed that most changes are not important and are usually the result of logging, changes in versions to prevent caching or external scripts. For example:

<a class="support-ga" target="_blank" href="#">0fb63cacd50e / 0fb63cacd50e @ 
-app-151</a>
+app-107</a>
<input type='hidden' name='csrfmiddlewaretoken' 
-value='82NB5DdySoICu1mqcl0RZVk5dMCOVEQd'
+value='a0zBgxBevaBugotGpNKI6kMPsIsBbH44'
/>

The previous example shows that looking at those changes is probably not very interesting or useful.

I would like to know if there is a git diff command to ignore that kind of changes. Another alternative is to have a ranking of the differences based on similarity. So far I have been using the git diff --word-diff=porcelain --unified=0 HEAD~1 HEAD command and then processing that output to extract changes, calculate the Levenshtein distance and remove duplicates. That helps but it is not a great solution considering that git already knows which lines are supposed to be compared and provides a configurable number of lines as context.

r_31415
  • 8,752
  • 17
  • 74
  • 121
  • Not sure what your use case is - Logs and generated code like this are usually never part of the git repo. Anything generated during build/runtime is usually ignored to avoid this exact problem. – TheGeorgeous Mar 19 '17 at 01:30
  • Sure. Obviously, the large number of HTML documents are not part of a git repo. Most of them were downloaded via web scraping and the web sites use javascript heavily. By the way, I have seen that some git clients already offer a similarity measure, but I believe that is only for informative purposes. – r_31415 Mar 19 '17 at 01:51
  • The "similarity index" is built in to Git and is used for rename detection: see http://stackoverflow.com/a/21292993/1256452 (see the comments in particular: it's based on lines, minus white space, but broken into 60-character fragments for long lines or binary files). – torek Mar 19 '17 at 07:19
  • Interesting. Then that "similarity index" is not very useful (even if it could be used) here because each line will have a different hash. – r_31415 Mar 19 '17 at 18:38

1 Answers1

1

You could try and write a diff driver for ignoring specific patterns.
See this discussion as an example.

echo '*.html filter=ignore_value' >> .gitattributes
git config filter.ignore_value.clean "sed -e '/^value= .*$/d'" 

That is just a first draft, as the value attribute might not be at the start of the lines: you need to adjust the regex in order to detect and ignore any line with the change you wish to skip.

The OP Robert Smith points to (in the comments) a more complete command with:

git diff --unified=0 HEAD~1 HEAD | grep -v -E -f PATTERNS.txt
Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • I certainly read about drivers. Unfortunately, I think this approach is not likely to scale since I don't know in advance which patterns are supposed to be ignored. That's the reason I was trying to use a mathematically inclined way to filter irrelevant changes, but thank you for the suggestion. – r_31415 Mar 19 '17 at 01:56
  • @RobertSmith if you have a finite amount of changes you want to ignore, that can scale. But those changes (to be skipped) keep coming up... then Git won't help much. – VonC Mar 19 '17 at 01:57
  • The changes are, for the most part, repeated over and over again across documents and commits. However, as you can see from the examples above, it is not easy to guess which patterns need to be ignored, so the effort is considerable for hundreds of HTML documents. – r_31415 Mar 19 '17 at 02:03
  • @RobertSmith OK. When it comes to Git and `git diff`, I know only of those drivers. – VonC Mar 19 '17 at 02:04
  • I gave this solution a try, but I noticed that the filter only works for git diff (not git diff HEAD~1 HEAD) and only removes the + modified text (e.g. removes `+value='a0zBgxBevaBugotGpNKI6kMPsIsBbH44'` but `-value='82NB5DdySoICu1mqcl0RZVk5dMCOVEQd'` is still in the output). Is this correct? – r_31415 Mar 19 '17 at 21:08
  • @RobertSmith True, a filter driver (https://git-scm.com/docs/gitattributes#__code_filter_code) restore contents to the index, which is easy to do when the restoration if about *not* showing a modification (like an added `value` line), but less easy when it is about *adding back* deletion. – VonC Mar 19 '17 at 21:21
  • @RobertSmith You would need to combine the clean filter driver with a smudge filter driver in order to save the initial values. The solution becomes less straight-forward. – VonC Mar 19 '17 at 21:22
  • That's really unfortunate. So far the best solution has been piping the output of `git diff --unified=0 HEAD~1 HEAD` to `grep -v -E -f PATTERNS.txt`, which is a variation of what you suggested. – r_31415 Mar 20 '17 at 03:12
  • @RobertSmith Nice. I have included your comment in the answer for more visibility. – VonC Mar 20 '17 at 05:17
  • Great! Let me add the pipe, though. – r_31415 Mar 21 '17 at 03:25