-1

My task is to compare two html pages' content like how much they are different from each other. By difference I mean that how much both are different/identical w.r.t. divs, imgs, content, and other tags (all differences a user can visually interpret. Say if you are comparing two html pages for buying a product, so the buying process has 3 steps. If I compare step2 (Credit-Card info) and step3(Checkout/Confirmation page) then almost all contents outside of the buying panel of both pages are same but inside all contents are different. Hence user can visually interpret that both pages are different).

For this I used Levenshtein distance, the code is below

     /**
     * The method levenshteinDistance() is use to calculate the distance between
     * two strings
     * 
     * @param lhs
     *            first string
     * @param rhs
     *            secont sreing
     * @return distance
     */
    public static int levenshteinDistance(CharSequence lhs, CharSequence rhs) {
        int len0 = lhs.length() + 1;
        int len1 = rhs.length() + 1;

        // the array of distances
        int[] cost = new int[len0];
        int[] newcost = new int[len0];

        // initial cost of skipping prefix in String s0
        for (int i = 0; i < len0; i++)
            cost[i] = i;

        // dynamically computing the array of distances

        // transformation cost for each letter in s1
        for (int j = 1; j < len1; j++) {
            // initial cost of skipping prefix in String s1
            newcost[0] = j;

            // transformation cost for each letter in s0
            for (int i = 1; i < len0; i++) {
                // matching current letters in both strings
                int match = (lhs.charAt(i - 1) == rhs.charAt(j - 1)) ? 0 : 1;

                // computing cost for each transformation
                int cost_replace = cost[i - 1] + match;
                int cost_insert = cost[i] + 1;
                int cost_delete = newcost[i - 1] + 1;

                // keep minimum cost
                newcost[i] = Math.min(Math.min(cost_insert, cost_delete), cost_replace);
            }

            // swap cost/newcost arrays
            int[] swap = cost;
            cost = newcost;
            newcost = swap;
        }

        // the distance is the cost for transforming all letters in both strings
        return cost[len0 - 1];
    }

Questions

1) Is Levenshtein distance a correct way to compare two big html pages ?

1.1) If yes then sometimes the strings length are greater 120000 characters. At this point Levenshtein distance consume too much resources, sometimes it halts other processes/tomcat-server for some minutes. Again Is Levenshtein distance a correct way to compare two big html pages ?

1.2) If no then suggest me a good/efficient algorithm to compare such big html pages.

P.S: I am using Java 8 and tomcat 8 as a server.

Junaid
  • 2,572
  • 6
  • 41
  • 77
  • 2
    Perhaps running through a profiler would be helpful? Any low-hanging fruit there? – Joe C Jul 12 '17 at 06:06
  • May be you can try this solution mentioned here: [Alternative to Levenshtein and Trigram](https://stackoverflow.com/questions/20162894/alternative-to-levenshtein-and-trigram) – UVM Jul 12 '17 at 06:31
  • 1
    It is difficult to give an advise, because you don't mention why you want to compare the differences. For example, do you want to check that everything else is identical? Or do you want to concentrate on the differences during a review? – iBiber Jul 12 '17 at 08:37
  • @iBiber I want to check how much html pages are identical. – Junaid Jul 12 '17 at 09:02
  • For the downvoter let me know the reason for downvote. – Junaid Oct 08 '18 at 03:36

1 Answers1

0

If you want to know the differences in terms of "all differences a user can visually interpret", i think it is not logical to compare the codes (html) to extract differences. Because visually same parts may generated from different codes, so it will be false positive for you.

I might do it by screenshot comparison of pages to see visually differences.

yılmaz
  • 1,818
  • 13
  • 15
  • I agree that some parts are manipulated by javascript. But still most of the contents are same. My limitation is not to compare screenshot but compare them via html code. – Junaid Jul 12 '17 at 06:27