Use HashMap to store file positions and access these randomly using RandomAccessFile

Question

Initial problem:

I have the following issue: I am joining 2 CSVs using Java. While I can "stream" one of the CSVs (read in, process, write out line-by-line), the smaller one resides in memory (a HashMap to be precise), as I need to look up the keys of each row of the big CSV while going through it. The problem: if the "small CSV" is too large to keep in mem, I am running into OutOfMem errors.

While I know that I could avoid these issues by just reading both CSVs into a DB and perform the join there, it is infeasible in my application to do so. Is there a Java wrapper (or some other sort of object) which would allow me to keep only the HashMap's keys in memory, and put all of its values into a temp file on disk (in a self-managed fashion)?

Update:

After the comments of ThomasKläger and JacobG, I solved the problem in the following way:

Use a HashMap to store a row’s keys and that row’s start and end position using RandomAccessFile’s .getFilePointer().

While going through the large CSV, I am now using the HashMap to look up the matching rows’ positions, .seek(pos), and read them.

This is a working solution, thanks a lot.

@Makoto: I am actively considering H2DB, but would like to avoid it if possible. But maybe it will be the only possible route to go. — dotwin, Jun 23 '17 at 15:59
You could `ehcache`. Their `Cache` offers many methods that a `Map` also offers and can be configured to offload entries onto a disk storage. — Thomas Kläger, Jun 23 '17 at 16:10
How big is your smaller file? On a modern 64 bit machine with 16 GBytes of RAM i would expect that you can handle a "smaller" file up to 2 GBytes (and a bigger file without limit) without a problem — Thomas Kläger, Jun 23 '17 at 16:18
@ThomasKläger: I am currently looking into ehcache and it looks promising. The small CSVs are 9GB and this is where the problem is stemming from... — dotwin, Jun 23 '17 at 16:21
Use a `File` or a `FileReader` as the value of the `Map` and store your information in there. — Jacob G., Jun 23 '17 at 16:41
@JacobG.: I already thought about that. The problem here would be that it would create > 10 million files which my file system will not enjoy. — dotwin, Jun 23 '17 at 16:47
So just store the names of the file paths as a `String` for the value — Jacob G., Jun 23 '17 at 16:48
@ThomasKläger: I have been reading up on Ehcache and like this idea more and more. Do you by any chance have a working code snipped with an in-mem-key on-disk-value mapping using Ehcache that I could use as a starting point? — dotwin, Jun 23 '17 at 16:50
@JacobG.: What do you mean exactly please? Do you mean that I could somehow store the exact position inside the small CSV that the row is at? Is there a way to store (and then directly jump to) a position inside a file? That would work perfectly, I guess. — dotwin, Jun 23 '17 at 16:52
Yeah, look into file seeking. However, there are libraries that can help you easily navigate CSV files. Google! — Jacob G., Jun 23 '17 at 16:53

score 0 · Answer 1 · answered Jun 23 '17 at 17:50

According to what you describe you need something like off heap collections, in example MapDb lib, http://www.mapdb.org/ From description:

MapDB provides Java Maps, Sets, Lists, Queues and other collections backed by off-heap or on-disk storage. It is a hybrid between java collection framework and embedded database engine.

Use HashMap to store file positions and access these randomly using RandomAccessFile

1 Answers1