0

This is the code I have written to perform a validation mechanism for comparing 2 files. I want to know is there a way to write it in a more performing way, because both of my files can have millions of records in it and this I believe will be slow in those cases.

I am thinking of adding a Hash map, every time I get occurrence of a line in the file, I will add +1 to key value. If not, the value of the key remains 1. If the record exists in the other map of file 2 , then I remove it from first map. If it doesn't, then I add it to the map. This goes alternation files till end.

I don't do a line by line comparison because order of the lines may be different in both files.

public static void main(String[] args) throws Exception {
    BufferedReader br1 = null;
    BufferedReader br2 = null;
    BufferedWriter br3 = null;
    String sCurrentLine;
    int linelength;
    List<String> list1 = new ArrayList<String>();
    List<String> list2 = new ArrayList<String>();
    List<String> unexpectedrecords = new ArrayList<String>();

    br1 = new BufferedReader(new FileReader("expected.txt"));

    br2 = new BufferedReader(new FileReader("actual.txt"));

    while ((sCurrentLine = br1.readLine()) != null) {
        list1.add(sCurrentLine);
    }
    while ((sCurrentLine = br2.readLine()) != null) {
        list2.add(sCurrentLine);
    }
    List<String> expectedrecords = new ArrayList<String>(list1);

    List<String> actualrecords = new ArrayList<String>(list2);

    if (expectedrecords.size() > actualrecords.size()) {
        linelength = expectedrecords.size();
    } else {
        linelength = actualrecords.size();
    }

    for (int i = 0; i < linelength; i++) {
        if (actualrecords.contains(expectedrecords.get(i))) {
            actualrecords.remove(expectedrecords.get(i));
        } else {
            unexpectedrecords.add(actualrecords.get(i));
        }
    }

    br3 = new BufferedWriter(new FileWriter(new File("c.txt")));
    br3.write("Records which are not present in actual");
    for (int x = 0; x < unexpectedrecords.size(); x++) {
        br3.write(unexpectedrecords.get(x));
        br3.newLine();
    }
    br3.write("Records which are in actual but no present in expected");
    for (int i = 0; i < actualrecords.size(); i++) {
        br3.write(actualrecords.get(i));
        br3.newLine();
    }
    br3.flush();
    br3.close();
}
Jithu Paul
  • 154
  • 2
  • 15
  • 1
    If it works and is complete, [Code Review](https://codereview.stackexchange.com/) can be a nice place to get high quality answers on ways to improve your code. – Serge Ballesta Apr 26 '18 at 15:06
  • have you try the retainAll method ? https://stackoverflow.com/a/2762137/1811730 – Dams Apr 26 '18 at 15:10
  • Possible duplicate of [Java Compare Two Lists](https://stackoverflow.com/questions/2762093/java-compare-two-lists) – Dams Apr 26 '18 at 15:11
  • See this answer: https://stackoverflow.com/a/32565564/4632333 – mies Apr 26 '18 at 15:12
  • @dams I had a look at this method. retain doesn't work for me since, I may have the same record twice in the file and I ant both to be taken into consideration. Same applies to removeAll also. – Jithu Paul Apr 26 '18 at 15:27

3 Answers3

3

HashMap solution

I thought about it and the HashMap solution is instant. I went ahead and coded up an example of it here.

It runs in 0ms while the arrayLists ran in 16ms for the same dataset

public static void main(String[] args) throws Exception {
    BufferedReader br1 = null;
    BufferedReader br2 = null;
    BufferedWriter bw3 = null;
    String sCurrentLine;
    int linelength;

    HashMap<String, Integer> expectedrecords = new HashMap<String, Integer>();
    HashMap<String, Integer> actualrecords = new HashMap<String, Integer>();

    br1 = new BufferedReader(new FileReader("expected.txt"));
    br2 = new BufferedReader(new FileReader("actual.txt"));

    while ((sCurrentLine = br1.readLine()) != null) {
        if (expectedrecords.containsKey(sCurrentLine)) {
            expectedrecords.put(sCurrentLine, expectedrecords.get(sCurrentLine) + 1);
        } else {
            expectedrecords.put(sCurrentLine, 1);
        }
    }
    while ((sCurrentLine = br2.readLine()) != null) {
        if (expectedrecords.containsKey(sCurrentLine)) {
            int expectedCount = expectedrecords.get(sCurrentLine) - 1;
            if (expectedCount == 0) {
                expectedrecords.remove(sCurrentLine);
            } else {
                expectedrecords.put(sCurrentLine, expectedCount);
            }
        } else {
            if (actualrecords.containsKey(sCurrentLine)) {
                actualrecords.put(sCurrentLine, actualrecords.get(sCurrentLine) + 1);
            } else {
                actualrecords.put(sCurrentLine, 1);
            }
        }
    }

    // expected is left with all records not present in actual
    // actual is left with all records not present in expected
    bw3 = new BufferedWriter(new FileWriter(new File("c.txt")));
    bw3.write("Records which are not present in actual\n");
    for (String key : expectedrecords.keySet()) {
        for (int i = 0; i < expectedrecords.get(key); i++) {
            bw3.write(key);
            bw3.newLine();
        }
    }
    bw3.write("Records which are in actual but not present in expected\n");
    for (String key : actualrecords.keySet()) {
        for (int i = 0; i < actualrecords.get(key); i++) {
            bw3.write(key);
            bw3.newLine();
        }
    }
    bw3.flush();
    bw3.close();
}

ex:

expected.txt

one
two
four
five
seven
eight

actual.txt

one
two
three
five
six

c.txt

Records which are not present in actual
four
seven
eight
Records which are in actual but not present in expected
three
six

ex 2:

expected.txt

one
two
four
five
seven
eight
duplicate
duplicate
duplicate

actual.txt

one
duplicate
two
three
five
six

c.txt

Records which are not present in actual
four
seven
eight
duplicate
duplicate
Records which are in actual but not present in expected
three
six
Caleb
  • 56
  • 7
  • I would like to have something which takes care of multiple occurrences. – Jithu Paul Apr 26 '18 at 16:24
  • @jithuPaul - I updated my answer. It will now handle duplicate records. by only removing the first match – Caleb Apr 26 '18 at 17:00
  • It is not preferable when an algorithm needs to read the complete contents of the files to be compared, into memory. There are line-sequential algorithms which do not run out of memory when processing for example two 10 GB files. – Flying Dutchman Apr 26 '18 at 17:16
  • @jithuPaul I made a hashmap solution which is far quicker than arrayLists – Caleb Apr 26 '18 at 17:21
  • @Flyingdutchman The problem here is that the input is not guaranteed to be in order and it is also possible to have duplicates so it is unlikely it can be done sequentially. – Caleb Apr 26 '18 at 17:21
  • 1
    Ok - I see the bold line which has been added now above, in the original question. I do strongly advocate for a merge sort of the two files, after they have been concatenated, and subsequently compare each pair of consecutive pair of lines in order to identify all duplicates. The merge-sort algorithm lends itself very well for using the file-system as cache. [Merge-sort explained](https://en.wikipedia.org/wiki/Merge_sort) – Flying Dutchman Apr 26 '18 at 18:16
  • @Caleb, the hash map thing looks just perfect for my scenario. – Jithu Paul Apr 26 '18 at 21:40
  • @Caleb Can this be extended to more than 2 files? Ex: multiple files in 2 different folders and the result is also multiple files which need to be created in a third folder.? Could you assist with this please where we mention folders instead of having to mention the exact file names which have to be compared and written to. – Mueez Siraj Feb 18 '21 at 09:12
1

In Java 8 you can use Collection.removeIf(Predicate<T>)

list1.removeIf(line -> list2.contains(line));
list2.removeIf(line -> list1.contains(line));

list1 will then contain everything that is NOT in list2 and list2 will contain everything, that is NOT in list1.

Christian Riese
  • 594
  • 1
  • 5
  • 18
  • 1
    But, this doesn't take care of multiple occurrences of a line. Eg: I have a 3 occurrences of the same line in file A , but only once in file B, I want to see the difference that 2 lines went missing. – Jithu Paul Apr 26 '18 at 15:41
0

On Unix/Linux computers, you can just call diff, which has been optimized for speed and memory usage.

The call looks like

String listFileDiffs = executeDiff(filenameWithPath1, filenameWithPath2);

The method is implemented by:

private String executeDiff(String filenameWithPath1, String filenameWithPath2) {
    StringBuffer output = new StringBuffer();
    Process p0;
    Process p1;
    Process p2;
    try {
        p0 = Runtime.getRuntime().exec("sort " + filenameWithPath1 + " > /tmp/sort1file");
        p0.waitFor();
        p1 = Runtime.getRuntime().exec("sort " + filenameWithPath2 + " > /tmp/sort2file");
        p1.waitFor();
        p2 = Runtime.getRuntime().exec("diff " + "/tmp/sort1file" + " " + "/tmp/sort2file");
        p2.waitFor();
        BufferedReader reader =
                new BufferedReader(new InputStreamReader(p2.getInputStream()));
        String line = "";
        while ((line = reader.readLine())!= null) {
            output.append(line + "\n");
        }
    } catch (Exception e) {
        LOG.error("Error: executeCommand ", e);
    }
    return output.toString();
}

You can add flags to diff in order to give more information regarding all file differences found.

The solution has been adapted to take into account the random order of the lines in each file. The Unix sort is being called for each of the two files. The diff is subsequently being run.

The Unix commands have matured over decades, and work with a high efficiency.

Flying Dutchman
  • 135
  • 1
  • 6