2

I, thank for your attention.

I want to export a lot of data, really a lot of data (6 million of rows) to a .csv file using java. The app is a swing application, with JPA, using toplink (ojdbc14).

I have tried to use:

BufferedWriter RandomAccessFile FileChannel

etc etc, but the consumption of memory remains very high, causing a Java Heap Out of Memory Exception, although I set the maximun heap size in 800m (-Xmx800m).

My last version of the souce code:

...(more lines of code)

FileChannel channel = getRandomAccessFile(tempFile).getChannel();
Object[][] data = pag.getRawData(); //Database data in a multidimentional array

            for (int j = 0; j < data.length; j++) {
                write(data[j], channel); //write data[j] (an array) into the channel
                freeStringLine(data[j]); //data[j] is an array, this method sets all positions =null
                data[j] = null;//sets reference in null
            }

            channel.force(false); //force writing in file system (HD)
            channel.close(); //Close the channel
            pag = null; 

...(more lines of code)

 private void write(Object[] row, FileChannel channel) throws DatabaseException {
    if (byteBuff == null) {
        byteBuff = ByteBuffer.allocateDirect(1024 * 1024);
    }
    for (int j = 0; j < row.length; j++) {
        if (j < row.length - 1) {
            if (row[j] != null) {
                byteBuff.put(row[j].toString().getBytes());
            }
            byteBuff.put(SPLITER_BYTES);
        } else {
            if (row[j] != null) {
                byteBuff.put(row[j].toString().getBytes());
            }
        }
    }
    byteBuff.put("\n".toString().getBytes());        
    byteBuff.flip();
    try {
        channel.write(byteBuff);
    } catch (IOException ex) {
        throw new DatabaseException("Imposible escribir en archivo temporal de exportación : " + ex.getMessage(), ex.getCause());
    }
    byteBuff.clear();
}

Being 6 millions of rows, I don't want to store that data in memory while the file is created. I made many temp files (wtih 5000 rows each one), and at the final of the process, append all those temp files in a single one, using two FileChannel. However, the exception for lack of memory is launched before the joining.

Do you now another strategy for export a lot of data?

Thanks a lot for any ansmwer. Sorry for my English, I'm improving xD

  • You'd be better of loading the data from the database as pages and writing each page to the file, then discarding the page before you load the next one. – Jeremy Aug 09 '11 at 21:12
  • Have you profiled the application to view the memory usage across objects/classes? I suspect that it is due to the size of the persistence context. Also, joining files is not a worthwhile exercise if you can write to the end of the CSV file. – Vineet Reynolds Aug 09 '11 at 21:13
  • Thank to Jeremy and Vineet for your replys. Jeremy, I created many archives because I had used BufferedWriter in earliest version of source code. But I'll try to use a single file with FileChannel :). Vineet, I use the profiler of NetBeans 6.9 for analyze the memory consumption. Pitifully, the "live results" tabs shows that the most memory consumer class is the array of bytes (byte[]), maybe I don't know completely the profiler. About the persistence context, disabling the caching, could reduce the memory consumption?. – Javier Escobar Aug 11 '11 at 20:34

1 Answers1

3

The answer is to use a "stream" approach - ie read one row, write one row as you scroll through the dataset. You'll need to get the query result as a cursor and iterate through it, not get the whole result set.

In JPA, use code something like this:

ScrollableResults cursor = session.createQuery("from SomeEntity x").scroll();

while (cursor.next()) {
    writeToFile(cursor);
}

This means you only have one row in memory at a time, which is totally scalable to any number of rows and uses minimal memory (it's faster anyway).

Getting all rows at once in a result set is a convenience approach which works for small result set (which is most of the time), but as usual, convenience comes at a cost and it doesn't work in all situations.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Oh, thank you vey much for your reply. Nevertheless, I'm not using te createQuery() method (I suposse that session is an instance of Query class), but te createNativeQuery() method, which execute a complex query (joining 3 tables). Is your approach correct for my problem? (I don't use an entity name in the query :S ) – Javier Escobar Aug 10 '11 at 04:05
  • @Bohemian need your help....i am using hibernate and facing same issue. i have 2 million records in table. i want to write all 2 million records from table to csv file. going through the above solution .... it will fire query for approximately 2 million time...just want to know if it is the right and only way to do so. or we can create multiple threads and allow them to read from the table to enhance the speed.. but still the question rise like how will we maintain that a particular record is already read out by thread. Please help .. dont have answer for this question from last 3 weeks. – sparsh610 Jul 01 '19 at 17:39
  • @sparsh610 I believe an operational need should have an operational solution. Java is great for applications that handle single requests. It doesn't work well for handling bulk operations. My strong recommendation is to not use java for this. Use a shell command to invoke a database utiity function. Every database has a simple way to dump data to csv - use that. It will be simple and easy to code, reliable and way faster than anything you can do in java. – Bohemian Jul 01 '19 at 17:58
  • @Bohemian thanks a lot ... was looking eagerly for your reply ... for your reply say...in worst case scenario if we have to do it via Java then what will be the best possible fastest way to implement the same.Please help me :) – sparsh610 Jul 01 '19 at 18:09
  • i also tried googling but didn't find much on the same – sparsh610 Jul 01 '19 at 18:12
  • @sparsh610 I would push back strongly on whoever says you "have to" use java. No one "has to" use java for this kind of thing. Your requirement is clearly an operational requirement. Using crontab, schedule a shell script to run to dump your data to csv. – Bohemian Jul 01 '19 at 18:12
  • @Bohemian fair enough. but that's the question asked by one of the interviewer today and he was keen to use the executor framework for the same. first he started to save that big data in some collection then he moved to the CSV. So i just want to be in the position so that if some interviewer ask me again the same question , i will be 100% confident. so is there any efficient way to save it in collection(list,set) ; – sparsh610 Jul 02 '19 at 05:17
  • @Bohemian https://stackoverflow.com/questions/56851156/multi-threading-multiple-threads-interacting-with-same-table – sparsh610 Jul 02 '19 at 11:14