Using hash to manage large remote csv

Question

We're trying to index the contents of a 3gb csv (not located on the box running the application). We're able to read the file with a BufferedReader, but we run into issues when we try to read efficiently. Someone suggested that we hash an id field with the contents of the line.

This seems like a good idea, but I cannot figure out how we can "buffer write" our hash map to a file. Seems like object writer only takes one massive "dump" object...

Anyone know of a way that we can continuously put entries to the same external hash map, and then read from these entires?

Thanks!

score 2 · Accepted Answer · answered Jul 24 '14 at 16:49

Consider using a database, then you will not need to keep the index in memory (assuming that you are not using an in-memory database).

Uses for a local database (in your situation)

let the database maintain the index.
you can cache changes to the external hash map and update less frequently than "always". This assumes that you don't need to keep the external hash map constantly up-to-date.

Without any details about your situation, it seems like a terrible idea to store stuff in a giant hash map when you can use a database and not have to roll-your-own for the solution.

That's essentially what I wrote in the second part of my answer 2 hours ago... — AlexR, Jul 24 '14 at 17:01

score 1 · Answer 2 · answered Jul 24 '14 at 14:32

1

A POC of what I think you want is this:

Map<Integer, String> cache;

void readCache(BufferedReader br) {
    cache = new HashMap<Integer, String>();
    int line = 1;
    for (;;) {
        String l = br.readLine();
        if (l == null) break;
        cache.put(line, l);
        line++;
    }
}

String getLine(int line) { return cache.get(line); }

Note that this will occupy a little more than your 3GB of JVM memory, so -Xmx5G is recommendable :)

If possible it might be more effective to import the CSV into a database and use SQL to read a specific line; this will increase performance without the need to cache on your box and without the need of >3GB RAM only for this single process.

answered Jul 24 '14 at 14:32

AlexR

2,412
16
26

Awesome. I think we may go with a DB solution. I was thinking more along the lines of using a serialized hash instead of an in-memory solution. I probably didn't word the question correctly! – Ajayc Jul 24 '14 at 17:14
@Ajayc I definately suggest that. A lightweight Java DB would be HSQL for example. You could even parse the CSV on the owner machine using Java code to make the transition easier. – AlexR Jul 24 '14 at 17:15

score 0 · Answer 3 · edited May 23 '17 at 10:33

A solution would be to use a (lightweight) database. Check out this SO question for a list of lightweight databases and disk-based hash maps: MapDB, jdbm2, JavaDB, BerkeleyDB are among the recommendations. This would take care of most of the issues for you, and you can easily index or query the data afterwards.

That said: If you really want to use just a hashmap, you could also try partitioning. You can either create multiple hashmaps and partition by id (horizontal partitioning) or create multiple hashmaps per id (vertical partitioning). This should allow you to get around the memory issues, although you might need to read the CSV file multiple times.

Using hash to manage large remote csv

3 Answers3