0

i have a program, that at the start generates big amount of data ( several GB, possibly more than 10GB ) and then for several times process all data, do something, process all data, do something... That much data doesn't fit into my RAM and when it starts paging, its really painful. What is the optimal way to store my data and in general, how to solve this problem?

Should i use DB even though i dont need to save the data after my program ends? Should i split my data somehow and just save it into files and load them when i need them? Or just keep using RAM and get over paging?

With DB and files there is a problem. I have to process the data by pieces. So i load chunk of data (lets say 500mb), calculate, load next chunk and after i load and calculate everything, i can do something and repeat the cycle. That means i would read from HDD the same chunks of data i read in previous cycle.

user3396293
  • 9
  • 1
  • 2
  • Use MongoDB, split your data into logical units, store them in documents and use [its aggregation framework](https://docs.mongodb.org/manual/aggregation/) to process them. Scales up to terabytes and is easy to use and setup. – Markus W Mahlberg Nov 21 '15 at 23:33

4 Answers4

0
  • try to reduce the amount of data.
  • try to modify the algorithm, to extract the relevant data at an early stage
  • try to divide and / or parallelize the problem, and execute it over several clients in a cluster of computing nodes
0

File-style will be enough for your task, couple sample:

  1. Use BuffereReader skip() method
  2. RandomAccessFile

Read this two, and problem with duplication chunks should go away.

Community
  • 1
  • 1
EnjoyLife
  • 3
  • 2
  • 5
  • I guess i described my problem poorly. I generate some data. Then i calculate something from the data, make some changes (not on data, data stays the same) and calculate again...This means, that in every cycle i will read whole file into memory part after part, because whole file doesn't fit into memory – user3396293 Nov 21 '15 at 22:49
  • Or my description terribly. 1. Save all data in to file(now its u db). 2. Read in to temporary buffer from 0 to 250mb - process data and flush/reset/close buffer. 3. Read next in to buffer, last batch - one cycle complete. – EnjoyLife Nov 21 '15 at 23:13
0

You should definitely try to reduce the amount of data and have multiple threads to handle your data.

FutureTask could help you :

ExecutorService exec = Executors.newFixedThreadPool(5);
FutureTask<BigDecimal> task1 = new FutureTask<>(new Callable<BigDecimal>() {

   @Override
   public BigDecimal call() throws Exception {
      return doBigProcessing();
   }

});

// start future task asynchronously
exec.execute(task1);

// do other stuff

// blocking till processing is over
BigDecimal result = task1.get();

In the same way, you could consider caching the future task to speed up your application if possible.

If not enough, you could use Apache Spark framework to process large datasets.

Medhi Redjem
  • 188
  • 1
  • 7
  • My program is already multi threaded, but that doesn't change the problem with data. Will look into Spark and other NoSQL databases – user3396293 Nov 21 '15 at 22:47
0

Before you think about performance you must consider belows:

  • find a good data structure for the data.
  • find good algorithms to process the data.

If you do not have enough memory space,

  • use memory mapped file to work on data

If you have a chance to process data without load all data

  • divide and conquer

And please give us more details.

RockOnGom
  • 3,893
  • 6
  • 35
  • 53