My task is to compare two html pages' content like how much they are different from each other. By difference I mean that how much both are different/identical w.r.t. div
s, img
s, content, and other tags (all differences a user can visually interpret. Say if you are comparing two html pages for buying a product, so the buying process has 3 steps. If I compare step2 (Credit-Card info) and step3(Checkout/Confirmation page) then almost all contents outside of the buying panel of both pages are same but inside all contents are different. Hence user can visually interpret that both pages are different).
For this I used Levenshtein distance, the code is below
/**
* The method levenshteinDistance() is use to calculate the distance between
* two strings
*
* @param lhs
* first string
* @param rhs
* secont sreing
* @return distance
*/
public static int levenshteinDistance(CharSequence lhs, CharSequence rhs) {
int len0 = lhs.length() + 1;
int len1 = rhs.length() + 1;
// the array of distances
int[] cost = new int[len0];
int[] newcost = new int[len0];
// initial cost of skipping prefix in String s0
for (int i = 0; i < len0; i++)
cost[i] = i;
// dynamically computing the array of distances
// transformation cost for each letter in s1
for (int j = 1; j < len1; j++) {
// initial cost of skipping prefix in String s1
newcost[0] = j;
// transformation cost for each letter in s0
for (int i = 1; i < len0; i++) {
// matching current letters in both strings
int match = (lhs.charAt(i - 1) == rhs.charAt(j - 1)) ? 0 : 1;
// computing cost for each transformation
int cost_replace = cost[i - 1] + match;
int cost_insert = cost[i] + 1;
int cost_delete = newcost[i - 1] + 1;
// keep minimum cost
newcost[i] = Math.min(Math.min(cost_insert, cost_delete), cost_replace);
}
// swap cost/newcost arrays
int[] swap = cost;
cost = newcost;
newcost = swap;
}
// the distance is the cost for transforming all letters in both strings
return cost[len0 - 1];
}
Questions
1) Is Levenshtein distance a correct way to compare two big html pages ?
1.1) If yes then sometimes the strings length are greater 120000 characters. At this point Levenshtein distance consume too much resources, sometimes it halts other processes/tomcat-server for some minutes. Again Is Levenshtein distance a correct way to compare two big html pages ?
1.2) If no then suggest me a good/efficient algorithm to compare such big html pages.
P.S: I am using Java 8 and tomcat 8 as a server.