0

I have two HTML pages, whose source code I want to compare. I have converted the page source to strings, and I need to know what will be the best way to compare both huge strings.

  • Shall I use normal compare methods e.g., page1.eql?(page2)?
  • Does str.eql?(str1) also compare symbols e.g., @?

I would really appreciate to know the best approach for comparison.

sawa
  • 165,429
  • 45
  • 277
  • 381
amjad
  • 2,876
  • 7
  • 26
  • 43
  • 2
    You just want to know IF there is a difference? Do you also compare one of the files to many others? – moritz Nov 14 '12 at 14:32
  • You converted the page source to strings? What were they before? Why isn't a simple string comparison sufficient? What happened when you tried to compare `@` to `@`? Have you looked at `diff`? What have you tried? – the Tin Man Nov 14 '12 at 14:44
  • I want to know if the content is different and extract the difference for further analysis. – amjad Nov 14 '12 at 14:52
  • You can use [Meld](http://meldmerge.org/) – ichigolas Nov 14 '12 at 16:05

5 Answers5

1

I'm not sure how detailed you want your comparison to be. If you want "diff-like" capabilities, you can check out a previous similar question asked: diff a ruby string or array

Community
  • 1
  • 1
Benjamin Tan Wei Hao
  • 9,621
  • 3
  • 30
  • 56
1

This is the levenshtein method that will print the difference between the string, I'm not sure if that's what you're looking for. Otherwise I would recommend just using page1.eql?(page2)

def levenshtein(a, b)
  case
    when a.empty? then b.length
    when b.empty? then a.length
    else [(a[0] == b[0] ? 0 : 1) + levenshtein(a[1..-1], b[1..-1]),
          1 + levenshtein(a[1..-1], b),
          1 + levenshtein(a, b[1..-1])].min
  end
end
ryandawkins
  • 1,537
  • 5
  • 25
  • 38
0

Check out the loofah gem (github link). It diffs HTML (and XML) subtrees semantically, meaning that meaningless whitespace is ignored, the order of attributes is ignored, etc.

Mike Dalessio
  • 1,352
  • 9
  • 11
0

Try using http://prettydiff.com/?lang=html

Pretty Diff will strip out comments and meaningless white space for the most accurate comparison. It also provides advanced options for fine tuning different kinds of false positive conditions.

austincheney
  • 1,189
  • 9
  • 11
0

That's something that the nokogiri-diff gem does. Since it's based on a genuine HTML parser, it will be more robust to gratuitous differences (e.g., in the layout).

akim
  • 8,255
  • 3
  • 44
  • 60