Improving speed and memory consumption when handling ArrayList with 100 million elements

Question

I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.

Every day I get another file and need to extract new elements (tens of thousands a day).

What's the best approach to solve my problem?

I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.

I use this code:

dataNew.removeAll(dataOld);

Tried to load data in HashSets - creation of HashSets is endless. The same with LinkedHashset.

Tried to load into ArrayLists and to sort only one of them

Collections.sort(dataNew);

but it didn't speed up the process of

dataNew.removeAll(dataOld);

Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).

I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).

So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).

Any suggestions, please? Maybe I need not alphabetic sorting of ArrayList, but something else?

UPDATE: This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.

Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.

Interesting, when I increased maximum heap size to 16Gb

java -Xms5G -Xmx16G -jar utils.jar

loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList

My code is very simple:

List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);

on second file the program gets

Killed

[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child [1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB

UPDATE2:

Thanks for all your ideas!

Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)

RAM consumption became 3.6Gb and processing time only 40 seconds!

Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.

If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.

p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.

I have not tried H2 with 100 million records, but why not try it to see if using an in memory DB helps. If not in-memory maybe sqllite — Yogesh_D, Sep 16 '15 at 07:25
Some more info on your data structure is needed. When you say "extract new elements", what do you mean? How are "new elements" identified? How is the data sorted? Also, if you only need to extract a sub set of the total amount of lines, it doesn't make sense to load the entire file into RAM and *then* process it, it seems better to parse the file line by line and keep only relevant ("new"?) elements. — JHH, Sep 16 '15 at 07:25
(Relational) Databases are wonderful things .. the include things like off-line sorting and the ability to add/manipulate data and work on sets .. but *anyway*, this question is Too Broad without more clear requirements of operations, etc, etc. — user2864740, Sep 16 '15 at 07:31
Short strings on 1.5GB file should not take 15GB memory. What are you trying to achieve is not clear from above details. Pls be specific. — Avis, Sep 16 '15 at 07:43
Bare loading to OracleSQL of 100 million elements will also take eternity. Some manipulations with Oracle external tables helped to reduce time to half an hour. but it is desirable to solve problem in Java (because it is only a small part of our program). — Oleg Gritsak, Sep 16 '15 at 08:57
Please, show us a sample input and required output, AND your current code. In general - holding all the data while not needing it seems unnecessary. Just read line-by-lien and parse out your arrays or whatever. If your data is digits, have you considered using some type of primitive collections (hppc, Goldman-Sachs collections, Koloboke, trove, fastutil)? — Petr Janeček, Sep 16 '15 at 08:57
From what country are the passports? Do they entries have a pattern? Like the first three chars are letters and the rest digits? — V G, Sep 16 '15 at 09:27
The first rule of storing lots of information in heap is "don't". *You most certainly do not need to keep around a 100-million-element unsorted ArrayList in main memory to do sequential lookups on it*. Databases can handle that volume with greater performance and a much smaller memory footprint. — tucuxi, Sep 16 '15 at 10:33

Petr Janeček · Accepted Answer · 2015-09-16T10:04:46.663

First of all, don't do Files.readAllLines(Paths.get("filename")) and then pass everything to a Set, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.

Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.

Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
    for (String line = reader.readLine(); line != null; line = reader.readLine()) {
        // process your line, maybe add to the Set for the old data?
        oldData.add(line);
    }
}

Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
    for (String line = reader.readLine(); line != null; line = reader.readLine()) {
        // Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
        boolean oldRemoved = oldData.remove(line);
        if (!oldRemoved) {
            newData.add(line);
        }
    }
}

You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.

Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.

Also, if your data are numbers, you could just use a long and hold that instead of trying to hold instances of String? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects might serve you very well as a standard HashSet has a lot of unnecessary object allocation.

Data is definitely not numbers: there are as many as 10 millions 10-digit numbers, but OP says he has 100 millions distinct passport IDs. — V G, Sep 16 '15 at 09:26
@SergeRogatch true, thanks for correction :) Then only the first part holds: data is not numbers. — V G, Sep 16 '15 at 10:50

score 2 · Answer 2 · answered Sep 17 '15 at 04:06

Thank's for all your ideas!

Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)

RAM consumption became 3.6Gb and processing time only 40 seconds!

Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.

If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.

p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.

matt · Answer 3 · 2015-09-16T10:51:33.097

0

I made a very simple spell checker, just checking if a word was in the dictionary was too slow for whole documents. I created a map structure, and it works great.

Map<String, List<String>> dictionary;

For the key, I use the first 2 letters of the word. The list has all the words that start with the key. To speed it up a bit more you can sort the list, then use a binary search to check for existence. I'm not sure the optimum length of key, and if your key gets too long you could nest the maps. Eventually it becomes a tree. A trie structure is possibly the best actually.

edited Sep 16 '15 at 10:51

answered Sep 16 '15 at 09:29

matt

10,892
3
22
34

I don't agree that tries are great for this case. Why would they be any faster than a plain HashMap? All records are equally small; tries are great for substring search, but in this case hashmaps are simpler. – tucuxi Sep 16 '15 at 10:19
@tucuxi I was thinking that a nested hasmap routine would eventually be called trie. It seems like you could save some space with a trie structure. – matt Sep 16 '15 at 10:48
@tucuxi I suppose because tries are space-efficient: multiple entries share the same prefix. In a HashMap you store the entire object for each entry. – V G Sep 16 '15 at 10:51
If keys are ~10 bytes each, the extra overhead may not be worth it. Each node in the ptrie would have a greater overhead than you can save from storing only key suffixes, because sets can be implemented with minimal overhead-per-elements (using an array to store them); while Tries require pointer-arrays. – tucuxi Sep 16 '15 at 11:34
@tucuxi I must agree, the extra overhead MAY be not worth it, but it really depends on the data. If for example most of entries have one of 100 2-letter prefixes (there are passport IDs), this may be worth it. – V G Sep 16 '15 at 13:49

score 0 · Answer 4 · answered Sep 16 '15 at 10:03

Pls split the strings into two and whatever part (str1 or str2) is repeated most use the intern() on it so to save duplication os same String again in Heap. Here i used intern() on both part just to show the sample but dont use it unless they are repeating most.

Set<MyObj> lineData = new HashSet<MyObj>();
String line = null;
BufferedReader bufferedReader = new BufferedReader(new FileReader(file.getAbsoluteFile()));
while((line = bufferedReader.readLine()) != null){
    String[] data = line.split(",");
    MyObj myObj = new MyObj();
    myObj.setStr1(data[0].intern());
    myObj.setStr1(data[1].intern());
    lineData.add(myObj);
}

public class MyObj {

    private String str1;
    private String str2;

    public String getStr1() {
        return str1;
    }

    public void setStr1(String str1) {
        this.str1 = str1;
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + ((str1 == null) ? 0 : str1.hashCode());
        result = prime * result + ((str2 == null) ? 0 : str2.hashCode());
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Test1 other = (Test1) obj;
        if (str1 == null) {
            if (other.str1 != null)
                return false;
        } else if (!str1.equals(other.str1))
            return false;
        if (str2 == null) {
            if (other.str2 != null)
                return false;
        } else if (!str2.equals(other.str2))
            return false;
        return true;
    }

    public String getStr2() {
        return str2;
    }

    public void setStr2(String str2) {
        this.str2 = str2;
    }

}

score 0 · Answer 5 · answered Sep 16 '15 at 10:15

Use a database; to keep things simple, use a Java-embedded database (Derby, HSQL, H2, ...). With that much information, you can really benefit from standard DB caching, time-efficient storage, and querying. Your pseudo-code would be:

if first use,
   define new one-column table, setting column as primary-key
   iterate through input records, for each:
       insert record into table
otherwise
   open database with previous records
   iterate through input records, for each:
       lookup record in DB, update/report as required

Alternatively, you can do even less work if you use existing "table-diff" libraries, such as DiffKit - from their tutorial:

java -jar ../diffkit-app.jar -demoDB
Then configure a connection to this demo database within your favorite JDBC enabled database browser [...] Your DB browser will show you the tables TEST10_LHS_TABLE and TEST10_RHS_TABLE (amongst others) populated with the data values from the corresponding CSV files.

That is: DiffKit does essentially what I proposed above, loading files into database tables (they use embedded H2) and then comparing these tables through DB queries.

They accept input as CSV files; but conversion from your textual input to their CSV can be done in a streaming fashion in less than 10 lines of code. And then you just need to call their jar to do the diff, and you would get the results as tables in their embedded DB.

score 0 · Answer 6 · edited May 23 '17 at 11:58

You can use a trie data structure for such cases: http://www.toptal.com/java/the-trie-a-neglected-data-structure The algorithm would be as follows:

Read the old file line by line and store each line in the trie.
Read the new file line by line and test each line whether it is in the trie: if it is not, then it is a newly added line.

A further memory optimization can take advantage that there are only 10 digits, so 4 bits is enough to store a digit (instead of 2 bytes per character in Java). You may need to adapt the trie data structure from one of the following links:

Ivan Senic · Answer 7 · 2015-09-16T15:28:30.907

The String object holding 11 characters (up to 12 in-fact) will have a size of 64 bytes (on 64bits Java with compressed oops). The only structure that can hold so much elements and be of a reasonable size is an array:

100,000,000 * (64b per String object + 4b per reference) = 6,800,000,000b ~ 6.3Gb

So you can immediately forget about Maps, Sets, etc as they introduce too much memory overhead.. But array is actually all you need. My approach would be:

Load the "old" data into an array, sort it (this should be fast enough)
Create a back-up array of primitive booleans with same size as the loaded array (you can use the BitSet here as well)
Read line by line from the new data file. Use binary search to check if the password data exists in the old data array. If the item exist mark it's index in the boolean array/bitset as true (you get back the index from the binary search). If the item does not exists just save it somewhere (array list can serve).
When all lines are processed remove from old array all the items that have false in boolean array/bitset (check by index of course). And finally add to the array all the new data you saved somewhere.
Optionally sort the array again and save to disk, so next time you load it you can skip the initial sorting.

This should be fast enough imo. Initial sort is O(n log(n)), while the binary search is O(log(n)) thus you should end up with (excluding final removal + adding which can be max 2n):

n log(n) (sort) + n log(n) (binary check for n elements) = 2 n log(n)

There would be other optimizations possible if you would explain more on the structure of that String you have (if there is some pattern or not).

Your alternative looks like the most efficient one. But it requires much coding. Maybe one day (when the file becomes 2-3-4 GBs) I will try to implement it. Thanks! — Oleg Gritsak, Sep 18 '15 at 02:07

Alex Karasev · Answer 8 · 2015-09-17T10:03:12.397

-1

The main problem in numerous resizing ArrayList when readAllLines() occurs. Better choice is LinkedList to insert data

try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
        List<String> result = new LinkedList<>();
        for (;;) {
            String line = reader.readLine();
            if (line == null)
                break;
            result.add(line);
        }
        return result;
    }

edited Sep 17 '15 at 10:03

answered Sep 16 '15 at 09:48

Alex Karasev

1,108
2
13
24

not true. ArrayList inserts (at end-of-list) are amortized O(1) - they only grow log(n) times in the worst case. And memory-wise, a linked list has way more overhead (at least 8 bytes per record), and much worse locality. – tucuxi Sep 16 '15 at 09:56
In the question's example worst case guaranteed – Alex Karasev Sep 16 '15 at 10:01
It is still not O(n)-per-element: log n*O(n) (growths) + O(1)*O(n) (insertions) means inserting n elements is O(n), or amortized O(1)-per-element. Your answer is still very wrong. – tucuxi Sep 16 '15 at 10:05
2

I've done some research. You are right, it was a good lesson. Thanks! – Alex Karasev Sep 17 '15 at 10:33

Improving speed and memory consumption when handling ArrayList with 100 million elements

8 Answers8

Linked