1

I am exploring an option to compare two files in Java and show the difference in html.

Below is the code, I am using -

import java.io.File;
import java.io.IOException;
 
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;
import org.apache.commons.text.diff.CommandVisitor;
import org.apache.commons.text.diff.StringsComparator;
 
public class FileDiff {
 
    public static void main(String[] args) throws IOException {
        // Read both files with line iterator.
        LineIterator file1 = FileUtils.lineIterator(new File("file-1.txt"), "utf-8");
        LineIterator file2 = FileUtils.lineIterator(new File("file-2.txt"), "utf-8");
 
        // Initialize visitor.
        FileCommandsVisitor fileCommandsVisitor = new FileCommandsVisitor();
 
        // Read file line by line so that comparison can be done line by line.
        while (file1.hasNext() || file2.hasNext()) {
            /*
             * In case both files have different number of lines, fill in with empty
             * strings. Also append newline char at end so next line comparison moves to
             * next line.
             */
            String left = (file1.hasNext() ? file1.nextLine() : "") + "\n";
            String right = (file2.hasNext() ? file2.nextLine() : "") + "\n";
 
            // Prepare diff comparator with lines from both files.
            StringsComparator comparator = new StringsComparator(left, right);
 
            if (comparator.getScript().getLCSLength() > (Integer.max(left.length(), right.length()) * 0.4)) {
                /*
                 * If both lines have atleast 40% commonality then only compare with each other
                 * so that they are aligned with each other in final diff HTML.
                 */
                comparator.getScript().visit(fileCommandsVisitor);
            } else {
                /*
                 * If both lines do not have 40% commanlity then compare each with empty line so
                 * that they are not aligned to each other in final diff instead they show up on
                 * separate lines.
                 */
                StringsComparator leftComparator = new StringsComparator(left, "\n");
                leftComparator.getScript().visit(fileCommandsVisitor);
                StringsComparator rightComparator = new StringsComparator("\n", right);
                rightComparator.getScript().visit(fileCommandsVisitor);
            }
        }
 
        fileCommandsVisitor.generateHTML();
    }
}
 
/*
 * Custom visitor for file comparison which stores comparison & also generates
 * HTML in the end.
 */
class FileCommandsVisitor implements CommandVisitor<Character> {
 
    // Spans with red & green highlights to put highlighted characters in HTML
    private static final String DELETION = "<span style=\"background-color: #FB504B\">${text}</span>";
    private static final String INSERTION = "<span style=\"background-color: #45EA85\">${text}</span>";
 
    private String left = "";
    private String right = "";
 
    @Override
    public void visitKeepCommand(Character c) {
        // For new line use <br/> so that in HTML also it shows on next line.
        String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
        // KeepCommand means c present in both left & right. So add this to both without
        // any
        // highlight.
        left = left + toAppend;
        right = right + toAppend;
    }
 
    @Override
    public void visitInsertCommand(Character c) {
        // For new line use <br/> so that in HTML also it shows on next line.
        String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
        // InsertCommand means character is present in right file but not in left. Show
        // with green highlight on right.
        right = right + INSERTION.replace("${text}", "" + toAppend);
    }
 
    @Override
    public void visitDeleteCommand(Character c) {
        // For new line use <br/> so that in HTML also it shows on next line.
        String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
        // DeleteCommand means character is present in left file but not in right. Show
        // with red highlight on left.
        left = left + DELETION.replace("${text}", "" + toAppend);
    }
 
    public void generateHTML() throws IOException {
 
        // Get template & replace placeholders with left & right variables with actual
        // comparison
        String template = FileUtils.readFileToString(new File("difftemplate.html"), "utf-8");
        String out1 = template.replace("${left}", left);
        String output = out1.replace("${right}", right);
        // Write file to disk.
        FileUtils.write(new File("finalDiff.html"), output, "utf-8");
        System.out.println("HTML diff generated.");
    }
}

For smaller files this works good and gives me good results on my laptop. But if file size is more (200MB) with half a million of rows then my IntelliJ seems to hang. RAM for my laptop is 16GB.

How can I improve this to handle large files for comparison?

Thanks

Abhi
  • 309
  • 1
  • 10

1 Answers1

1

The way you wrote FileCommandsVisitor might cause it to fail to get optimized. What you're doing is adding strings for every character visited, for instance:

left = left + toAppend;
right = right + toAppend;

That might cause a new instance of a String to happen for every addition you do - new instance of a string that by the end is nearly 200 MB long. A new one for every character you visit. And old ones will need to get garbage collected. If your class held StringBuilders instead, and you used append() method it might drastically speed up. For more details read String concatenation: concat() vs "+" operator

For clarity (since according to comments you missed the point twice now):

class FileCommandsVisitor implements CommandVisitor<Character> {

//StringBuilder as properties
private StringBuilder left = new StringBuilder();
private StringBuilder right = new StringBuilder();

@Override
public void visitKeepCommand(Character c) {
    String toAppend = "\n".equals("" + c) ? "<br/>" : "" + c;
    // append to the StringBuilders where you would concat strings
    left.append(toAppend);
    right.append(toAppend);
}

//same as above for other methods

..

public void generateHTML() throws IOException {

    String template = FileUtils.readFileToString(new File("difftemplate.html"), "utf-8");
    //turn StringBuilders into Strings only when you actually need a String.
    String out1 = template.replace("${left}", left.toString());
    String output = out1.replace("${right}", right.toString());
    FileUtils.write(new File("finalDiff.html"), output, "utf-8");
    System.out.println("HTML diff generated.");
}

}

If that doesn't help however, and it was optimized at runtime - I don't see anything else fundamentally wrong with the way you're doing it. Comparing huge files is not a cheap operation, it won't be faster than the speed with which you can read two files line by line from your hard drive. You're still making a shortcut (that increases speed, not decreases) in having your FileCommandsVisitor hold both diffs in memory instead of writing it as it goes, which means that at best your code can diff a file of a size equal to half your available RAM. I note however, that you never mentioned how long it actually takes, so it's hard to say if the time you're seeing is expected or an anomaly.

Deltharis
  • 2,320
  • 1
  • 18
  • 29
  • After your last comment I gave it a run for comparison of 5MB files with 8k lines each. But even after 2 hours comparison is not completed. – Abhi Jan 13 '22 at 15:14
  • And yes, that was after changing + to concat(). So that change doesn't change much in performance – Abhi Jan 13 '22 at 15:23
  • @Abhi uh... no, not .concat(), that will also return a String instance every time you do it. You need to delay creating a String as long as you can by using `StringBuilders` as your properties and using the `append()` method to add to them, like I wrote in the answer – Deltharis Jan 13 '22 at 15:58
  • ohh sorry. I gave a run after changing + to left=new StringBuilder().append(left).append(toAppend).toString(); right=new StringBuilder().append(right).append(toAppend).toString(); but its still 2 hours and run is not completed. – Abhi Jan 13 '22 at 18:00
  • @Abhi you missed my point again it seems, still creating a String every time in a method. I edited my answer to clearer explain what I mean. – Deltharis Jan 14 '22 at 07:48