0

I am processing a huge CSV (1GB) using java code.

My Application is Running on 2 Core Machine with 8GB memory.

I am using below command to start my application.

java -Xms4g -Xmx6g  -cp $CLASSPATH JobSchedulerService

Applcation starts a thread to dwonload CSV from S3 and process it. Application works file for some time but OutOfMemoryError half way processing the file.

I am looking for a way where I can continue to process the CSV file and at the same time keep my memory usage low.

in CSV process I am performing following Steps:

 //Step 1: Download FROM S3
String bucketName = env.getProperty(AWS_S3_BUCKET_NAME);
AmazonS3 s3Client = new AmazonS3Client(credentialsProvider);
S3Object s3object = s3Client.getObject(new GetObjectRequest(bucketName, key));
InputStream inputSteam =  s3object.getObjectContent();   //This Stream contains about 1GB of data

//Step 2: Parse CSV to Java
ObjectReader oReader = CSV_MAPPER.readerFor(InboundProcessing.class).with(CSV_SCHEMA);
try (FileOutputStream fos = new FileOutputStream(outputCSV, Boolean.FALSE)) {
    SequenceWriter sequenceWriter = CsvUtils.getCsvObjectWriter(InboundProcessingDto.class).writeValues(fos);
    MappingIterator<T>  mi = oReader.readValues(inputStream)

    while (mi.hasNextValue()) {
        InboundProcessing inboundProcessing = mi.nextValue();
        inboundProcessingRepository.save(inboundProcessing);   // this is Spring JPA Entity Save operation. (Almost 3M records  so 3M calls)                    
        sequenceWriter.write(inboundProcessingDto);  // this is writing to a CSV file on local file system which is uploaded to S3 in next Step
    }
} catch (Exception e) {
    throw new FBMException(e);
}
Pramod
  • 387
  • 1
  • 7
  • 19
  • 2
    It looks like you are reading the whole thing into memory at once. Is that necessary? – pvg Sep 14 '17 at 12:41
  • 1
    If your start command really contains `java -Xms4g -Xms6g ...` you should correct it to `java -Xms4g -Xmx6g ...`. – blafasel Sep 14 '17 at 12:59
  • Thanks . It was a typo. – Pramod Sep 14 '17 at 16:48
  • @pvg I am trying to read it line by line. I am not sure if memory is being flushed after i move to next line – Pramod Sep 14 '17 at 16:51
  • @PramodBindal it's kind of hard to tell since you've shown a tiny bit of your reading code, haven't specified the libraries you're using, etc. You should probably edit your question with those details – pvg Sep 14 '17 at 16:56
  • @pvg Code updated. I am using Jackson CSV API for CSV parsing and Spring-Data with hibernate to save data to DB. This is my complete code. nothing more than that except some loggers and calculations – Pramod Sep 14 '17 at 17:45

3 Answers3

0

1) Split the big-size file into small-size files.

2) Process each files one by one sequentially or parallel.

Check link to split file in small size: https://stackoverflow.com/a/2356156/8607192

Or

Use Unix command "split for split based on size".

Anil K
  • 110
  • 9
0

I found reason for OOM. Although I am reading the file right way. Reading the file line by line and discarding old line as soon as I am done with processing. so thats not creating problem.

Problem is when I am writing the same to database.

My code runs in a Transactional block, because of which entities are not released until transaction is complete. in short all the 3M entities are kept in memory until transaction is committed.

I was able to reach to this conclusion once I added finalize method in my suspected objects. All I was able to see is that DTOS (temporary Pojo) are getting discarded as very fast speed but not even a single entity discarded. Finally all of a sudden all entities discarded.

Pramod
  • 387
  • 1
  • 7
  • 19
-1

You have not closed InputStream inputSteam

About s3object.getObjectContent() Gets the input stream containing the contents of this object.

Note: The method is a simple getter and does not actually create a stream. If you retrieve an S3Object, you should close this input stream as soon as possible, because the object contents aren't buffered in memory and stream directly from Amazon S3. Further, failure to close this stream can cause the request pool to become blocked.

amoljdv06
  • 2,646
  • 1
  • 13
  • 18