Fast way to get different lines from 2 big files in Java

Question

In Java, I have a method that reads two files, with each line being a GUID. The lines are unordered. The output is two new files with the lines that appear only on each file.

Example files:

| Input_1 | Input_2 |         | Output_1 | Output_2 |
| ------- | ------- |         | -------- | -------- |
| abcdef  | uvwxyz  |    >    | mnopqr   | uvwxyz   |
| ghijkl  | ghijkl  |
| mnopqr  | abcdef  |

I managed to do it fine with one Collection<String> for each file and some addAll() + removeAll() shenanigans, however the files are growing in size and this whole thing is taking some time now. Each file has about 600k lines.

Is there a fast way to improve this code just using another type of collection or I need to refactor my way of doing?

Code in question:

//Read two files
Collection<String> guidFile1 = readFileGuid(pathFile1);
Collection<String> guidFile2 = readFileGuid(pathFile2);

//Add file1 and remove file2
Collection<String> leftFromFile1 = new ArrayList<String>();
leftFromFile1.addAll(guidFile1);
leftFromFile1.removeAll(guidFile2);

//Add file2 and remove file1
Collection<String> leftFromFile2 = new ArrayList<String>();
leftFromFile2.addAll(guidFile2);
leftFromFile2.removeAll(guidFile1);

//Outputs
System.out.println("Leftover from file1: " + leftFromFile1.size());
System.out.println("Leftover from file2: " + leftFromFile2.size());

which implementation you've used for the Collection interface? try hashSet instead of List. — Kent, Jan 04 '22 at 14:12
This community is more about "not working" code. You rather want a review regarding performance aspects, so https://codereview.stackexchange.com/ might be a better place. Having said that, there are many aspects here. For example: does order matter? If you use Set objects instead of Lists, section/intersection would be much more efficient. But then you might lose the order you had in your file. Long story short: the optimal solution depends on your exact requirements. — GhostCat, Jan 04 '22 at 14:13
Following up on order: if order doesnt matter, you could sort your SORT the lists after reading them in from file. Doing such "delta" computations on SORTED content boils down to iterating the 2 lists in lockstep, just once, and avoid the overhead of calling `removeAll()` twice for example. Also note: sorting is something that you could already BEFOE even reading in the files, like using some command line tuning. So, as said: there are a lot of options, but they are all different depending on your exact scenario. — GhostCat, Jan 04 '22 at 14:16
If the files are unordered I'd say you need to create a set of guids in file 1 and then check which are present in file 2 (you could use a stream here). A simple approach would be to build the entire set for file 1 and then process file 2. If the files have similar ordering you could also try a more sophisticated approach that reads the files in parallel (e.g. smaller batches) but this can get very complex. — Thomas, Jan 04 '22 at 14:17
@GhostCat you are completely right about the website and now I feel a little ashamed. I dont need order, just the lines. The files are unordered at first. — Ronaldo Sabin, Jan 04 '22 at 14:19
As stated above, to return the `Set` from `readFileGuid` is better. Then you can iterate through the first set, and if the element is found in second set, remove it from both. — augur, Jan 04 '22 at 14:21
No need to feel ashamed. You got as many facts into your question as possible for you; you are doing way better than the average first questions that show up here each day. Also note: the real answer here would be: start experimenting. You can solve this problem in many different ways, so, to get the most out of it: **try** them one by one. You can do this with sets. You can do this by using lists, and ordering them. Beyond that , this is most likely a "solved" problem that has a defined "optimal solution" — GhostCat, Jan 04 '22 at 14:34
Like: https://stackoverflow.com/questions/3462143/get-difference-between-two-lists for python ... — GhostCat, Jan 04 '22 at 14:36
Have a look at my implementation here https://github.com/walaniam/java-sandbox/blob/main/src/main/java/walaniam/stack/TwoExclusiveSets.java with the test covering the performance https://github.com/walaniam/java-sandbox/blob/main/src/test/java/walaniam/stack/TwoExclusiveSetsTest.java For values I've randomly generated two sets, 600k elements each of them. On average I was getting ~95ms to search unique values. — Mariusz W., Jan 11 '22 at 10:06

score 0 · Accepted Answer · answered Jan 04 '22 at 16:22

In your code, the removeAll is the most expensive operation. If both files are 100 lines long, then removeAll would perform 20,000x operations. If you include addAll, then it would perform a total of 20,200 operations. What you want is 200 operations. The way to do that is to use a Set.

  static HashSet<String> readFileGuid(String path) throws IOException{
      HashSet<String> guidFile = new HashSet<>();
      Scanner s = new Scanner(new File(path));
      while(s.hasNextLine()) 
        guidFile.add(s.nextLine());
      s.close();
      return guidFile;
  }
  
  static List<String> subtract(HashSet<String> s1, HashSet<String> s2){
      List<String> result = new ArrayList<>();
      Iterator<String> it = s1.iterator();
      while(it.hasNext()){
          String item = it.next();
          if (!s2.contains(item))
            result.add(item);
      }
      return result;
  }
  
  public static void main (String[]args) throws IOException{
      HashSet<String> guidFile1 = readFileGuid("input1.txt");
      HashSet<String> guidFile2 = readFileGuid("input2.txt");
      
      List<String> leftFromFile1 = subtract(guidFile1, guidFile2);
      List<String> leftFromFile2 = subtract(guidFile2, guidFile1);
      
      System.out.println("file1:" + leftFromFile1);
      System.out.println("file2:" + leftFromFile2);
  }

Fast way to get different lines from 2 big files in Java

1 Answers1