I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.
Every day I get another file and need to extract new elements (tens of thousands a day).
What's the best approach to solve my problem?
I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.
I use this code:
dataNew.removeAll(dataOld);
Tried to load data in HashSets - creation of HashSets is endless. The same with LinkedHashset.
Tried to load into ArrayLists and to sort only one of them
Collections.sort(dataNew);
but it didn't speed up the process of
dataNew.removeAll(dataOld);
Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).
I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).
So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).
Any suggestions, please? Maybe I need not alphabetic sorting of ArrayList, but something else?
UPDATE: This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.
Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.
Interesting, when I increased maximum heap size to 16Gb
java -Xms5G -Xmx16G -jar utils.jar
loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList
My code is very simple:
List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);
on second file the program gets
Killed
[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child [1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB
UPDATE2:
Thanks for all your ideas!
Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.