0

Initial problem:

I have the following issue: I am joining 2 CSVs using Java. While I can "stream" one of the CSVs (read in, process, write out line-by-line), the smaller one resides in memory (a HashMap to be precise), as I need to look up the keys of each row of the big CSV while going through it. The problem: if the "small CSV" is too large to keep in mem, I am running into OutOfMem errors.

While I know that I could avoid these issues by just reading both CSVs into a DB and perform the join there, it is infeasible in my application to do so. Is there a Java wrapper (or some other sort of object) which would allow me to keep only the HashMap's keys in memory, and put all of its values into a temp file on disk (in a self-managed fashion)?


Update:

After the comments of ThomasKläger and JacobG, I solved the problem in the following way:

Use a HashMap to store a row’s keys and that row’s start and end position using RandomAccessFile’s .getFilePointer().

While going through the large CSV, I am now using the HashMap to look up the matching rows’ positions, .seek(pos), and read them.

This is a working solution, thanks a lot.

dotwin
  • 1,302
  • 2
  • 11
  • 31
  • Have you considered SQLite? – Makoto Jun 23 '17 at 15:54
  • @Makoto: I am actively considering H2DB, but would like to avoid it if possible. But maybe it will be the only possible route to go. – dotwin Jun 23 '17 at 15:59
  • You could `ehcache`. Their `Cache` offers many methods that a `Map` also offers and can be configured to offload entries onto a disk storage. – Thomas Kläger Jun 23 '17 at 16:10
  • How big is your smaller file? On a modern 64 bit machine with 16 GBytes of RAM i would expect that you can handle a "smaller" file up to 2 GBytes (and a bigger file without limit) without a problem – Thomas Kläger Jun 23 '17 at 16:18
  • @ThomasKläger: I am currently looking into ehcache and it looks promising. The small CSVs are 9GB and this is where the problem is stemming from... – dotwin Jun 23 '17 at 16:21
  • Use a `File` or a `FileReader` as the value of the `Map` and store your information in there. – Jacob G. Jun 23 '17 at 16:41
  • @JacobG.: I already thought about that. The problem here would be that it would create > 10 million files which my file system will not enjoy. – dotwin Jun 23 '17 at 16:47
  • So just store the names of the file paths as a `String` for the value – Jacob G. Jun 23 '17 at 16:48
  • @ThomasKläger: I have been reading up on Ehcache and like this idea more and more. Do you by any chance have a working code snipped with an in-mem-key on-disk-value mapping using Ehcache that I could use as a starting point? – dotwin Jun 23 '17 at 16:50
  • @JacobG.: What do you mean exactly please? Do you mean that I could somehow store the exact position inside the small CSV that the row is at? Is there a way to store (and then directly jump to) a position inside a file? That would work perfectly, I guess. – dotwin Jun 23 '17 at 16:52
  • Yeah, look into file seeking. However, there are libraries that can help you easily navigate CSV files. Google! – Jacob G. Jun 23 '17 at 16:53

1 Answers1

0

According to what you describe you need something like off heap collections, in example MapDb lib, http://www.mapdb.org/ From description:

MapDB provides Java Maps, Sets, Lists, Queues and other collections backed by off-heap or on-disk storage. It is a hybrid between java collection framework and embedded database engine.

fxrbfg
  • 1,756
  • 1
  • 11
  • 17