0

I need to order a huge csv file (10+ million records) with several algorithms in Java but I've some problem with memory amount.

Basically I have a huge csv file where every record has 4 fields, with different type (String, int, double). I need to load this csv into some structure and then sort it by all fields.

What was my idea: write a Record class (with its own fields), start read csv file line by line, make a new Record object for every line and then put them into an ArrayList. Then call my sorter algorithms for each field.

It doesn't work.. I got and OutOfMemoryException when I try lo load all Record object into my ArrayList.

In this way I create tons of object and I think that is not a good idea. What should I do when I have this huge amount of data? Which method/data structure can ben less expensive in terms of memory usage?

My point is just to use sort algs and look how they work with big set of data, it's not important save the result of sorting into a file.

I know that there are some libs for csv, but I should implements it without external libs.

Thank you very much! :D

  • did you try to assign your JVM more memory? – whatTheFox Apr 15 '16 at 09:41
  • http://stackoverflow.com/questions/6452765/how-to-increase-heap-size-of-jvm – Simon Apr 15 '16 at 09:41
  • Not yet, but is it a good idea? I mean, are there any other possibilities to reduce memory usage? More effecient data structure, better way to store fields of a csv file... I think about it and I didn't find anything, but I'm a newbie in java and maybe I'm missing something :). Thanks! –  Apr 15 '16 at 09:48
  • Giving your JVM more memory isn't a bad idea unless you have limited resources. As to the efficiency of data structures: it depends heavily on the data you are working with. We can't possibly give you any answers or hints without seeing some data. You might want to provide some examples ;) – whatTheFox Apr 15 '16 at 11:04
  • It's a standard csv file, each field separated by comma and every line/record end with \n: ,,,\n There are about 15 millions of records, the point is load this file into some structure and then sort it by each field, with different sorting algorithms. It's just for academi purpose, in a real context you may never need to order such big data and I should use only java with no other external libs or tools (like another user suggest below [thanks anyway!]) and , of course, without using any build-in java sort algos (if any exists). –  Apr 22 '16 at 14:08

2 Answers2

0

Cut your file into pieces (depending on the size of the file) and look into merge sort. That way you can sort even big files without using a lot of memory, and it's what databases use when they have to do huge sorts.

Kayaman
  • 72,141
  • 5
  • 83
  • 121
-1

I would use an in memory database such as h2 in in-memory-mode (jdbc:h2:mem:) so everything stays in ram and isn't flushed to disc (provided you have enough ram, if not you might want to use the file based url). Create your table in there and write every row from the csv. Provided you set up the indexes properly sorting and grouping will be a breeze with standard sql

Riz
  • 1,055
  • 11
  • 18
  • Thanks, nice idea but I should use java standard libs and no external tools like db ecc. There are a lot of possibilities I think, but this is just for academic purpose, the point is learn and see how sorting works, which is better, in which case ecc :) . Thanks anyway –  Apr 15 '16 at 09:59