A description of what I'm going to accomplish:
- Input 2 (N is not essential) HTML documents.
- Standardize the HTML format
- Diff the two documents -- external styles are not important but anything inline to the document will be included.
- Determine delta at the HTML Block Element level.
Expanding the last point:
Imagine two pages of the same site that both share a sidebar with what was probably a common ancestor that has been copy/pasted. Each page has some minor changes to the sidebar. The diff will reveal these changes, then I can "walk up" the DOM to find the first common block element shared by them, or just default to <body>
. In this case, I'd like to walk it up and find that, oh, they share a common <div id="sidebar">
.
I'm familiar with DaisyDiff and the application is similar -- in the CMS world.
I've also begun playing with the google diff-patch library.
I wanted to give ask this kind of non-specific question to hopefully solicit any advise or guidance that anybody thinks could be helpful. Currently if you put a gun to my head and said "CODE IT" I'd rewrite DaisyDiff in Python and add-in this block-level logic. But I thought maybe there's a better way and the answers to Anyone have a diff algorithm for rendered HTML? made me feel warm and fuzzy.