3

I'm writing a shell script which tracks the changes of a website and emails me with the contents of the change if one occurs. The idea is to use wget to grab a copy of the html and compare it to the version from the last time the script ran. Wget works fine to save the html file but I'm having trouble comparing the files. The trouble is that I'm only interested in changes in the html file's plain text, not the code, links, etc.

Diff works to find all the changes in the two files but it ALWAYS returns changes even when the plain text is identical. This is because each link on the site has a corresponding authenticity token that differs each time the page is accessed. In order to diff only the lines that include plain text I'm attempting to filter it to exclude any line that begins with "<" OR "(any_amount_of_spaces)<". I've looked at the diff man page but I can't seem to find an operator that will do what I need. I don't know much about REGEX but would that work with diff -I for this?

Thanks!

James_M
  • 55
  • 4

1 Answers1

3

You could use lynx -dump to render the pages and feed those to diff, but since you are not interested in links you would need to get rid of the References section that this yields (with e.g. awk) rendering this a not-so-robust solution (but maybe good enough for your use case).

If you don't mind using something 3rd-party go for html2text:

diff <(html2text before.html) <(html2text after.html)

PS: There are two different programs called html2text.

Adrian Frühwirth
  • 42,970
  • 10
  • 60
  • 71
  • Thanks! html2text looks perfect but it will not compile on os x. Will try on my Raspi tomorrow and report back. – James_M May 26 '13 at 01:02
  • 1
    There is a `homebrew` formula [here](https://github.com/mxcl/homebrew/blob/master/Library/Formula/html2text.rb), so you can either install it via [homebrew](http://mxcl.github.io/homebrew/) or apply the patch referenced in the formula yourself and try compiling it again. Looks like it should work! – Adrian Frühwirth May 26 '13 at 09:21