1

I have a csv file 'Master List' with 800 K records, each record have 13 values. combination of cell[0] and cell[1] give a unique record and I need to update value of cell[12] say status for every record.

I have another csv file say 'Updated subset list'. This is sort of subset of file 'Master list'. For all the records in my 2nd csv which are less in number say 10000, I need to update cell[11] aka status column value of each matching record.

I tried direct BufferedReader, CsvParser from commons-csv and CsvParser from univocity.parsers. But reading whole file and creating List of 800K is giving out of memory exception.

Same code will be deployed on different servers so I want to have a efficient code for reading huge csv file and updating same file.

Partially reading huge file and writing in same file might corrupt the data.

Any suggestions on how can I do this. ??

File inputF = new File(inputFilePath);

if (inputF.exists()) {
InputStream inputFS = new FileInputStream(inputF);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS));
// skip the header of the file
String line = br.readLine();
mandatesList = new ArrayList<DdMandates>();

while ((line = br.readLine()) != null) {
    mandatesList.add(mapToItem(line));
}

br.close();

}

Memory issue is resolved via doing it in chunks. reading single line and writing single line might result is taking more time. I didn't tried it as my issue was resolved with using batches of 100k records at time and clearing list after writing 100k records

Now issue is updating status is taking too much looping....

I have two csv's. Master sheet (Master list) have 800 K records then I have a subset csv as well say it have 10 k records. This subset csv is updated from some other system and it have updated status say 'OK' and 'NOT OK'. I need to update this status in Master sheet. How can I do that in best possible way. ??? Dumbest way I am using is follwing : –

 // Master list have batches but it contains 800 k records and 12 columns
    List<DdMandates> mandatesList = new ArrayList<DdMandates>();
// Subset list have updated status 
List<DdMandates> updatedMandatesList = new ArrayList<DdMandates>();
// Read Subset csv file and map DdMandates item and then add to updated mandate list


    File inputF = new File(Property.inputFilePath);
if(inputF.exists()) {
InputStream inputFS = new FileInputStream(inputF);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS, "UTF-8"));

checkFilterAndmapToItem(br);

br.close();

In Method checkFilterAndmapToItem(BufferedReader br)

    private static void checkFilterAndmapToItem(BufferedReader br) {
        FileWriter fileWriter = null;
        try {
            // skip the header of the csv
            String line = br.readLine();
            int batchSize = 0, currentBatchNo=0;
            fileWriter = new FileWriter(Property.outputFilePath);
            //Write the CSV file header
            fileWriter.append(FILE_HEADER.toString());
            //Add a new line separator after the header
            fileWriter.append(NEW_LINE_SEPARATOR);
            if( !Property.batchSize.isEmpty()) {
                batchSize = Integer.parseInt(Property.batchSize.trim());
            }
            while ((line = br.readLine()) != null) {

                DdMandates item = new DdMandates(); 
                String[] p = line.concat(" ").split(SEPERATOR);
                Parse each p[x] and map to item of type DdMandates\
                        Iterating here on updated mandate list to check if this item is present in updated mandate list
                        then get that item and update that status to item . so here is a for loop for say 10K elements
                mandatesList.add(item);

                if (batchSize != 0 && mandatesList.size() == batchSize) {
                    currentBatchNo++;
                    logger.info("Batch  no. : "+currentBatchNo+" is executing...");
                    processOutputFile(fileWriter);
                    mandatesList.clear();
                }
            }
            processing output file here for the last batch ...
        }

It will have while loop (800 K iteration) { insider loop 10K iteration for each element )

so at least 800K * 10K loop

Please help in getting its best possible way and reduce iteration .

Thanks in advance

SRana
  • 125
  • 1
  • 12
  • Reading a large csv file is never a good idea, you should do it chunks by chunks. – raviraja Jan 04 '19 at 07:23
  • @raviraja, Thanks, I thought of same, but say I am reading batch of 50k once and searching 10K records, updating same in another new csv file, then searching same 10k records in next 50k records etc etc... This is taking too much time consuming. Performance is impacted adversely in this case. – SRana Jan 04 '19 at 07:34
  • Use a Database. That's what they're for. – DevilsHnd - 退職した Jan 04 '19 at 08:41
  • Use a database as suggested. But you may want to try approaches which connect to csv as database instead. For example https://github.com/jprante/jdbc-driver-csv Then you'll be able to run normal SQL queries. This is only for reading – Akceptor Jan 04 '19 at 08:48
  • Write output to another file then move the new over the old when done. – ThomasRS Jan 04 '19 at 09:09

2 Answers2

2

Suppose you are reading 'Main Data File' in batches of 50K:

  • Store this data in java HashMap using cell[0] and cell[1] as key and rest of the columns as value.

  • The complexity of get and put is O(1) most of the time. see here

  • So the complexity for searching 10K records in that particular batch will be O(10K).

    HashMap<String, DdMandates> hmap = new HashMap<String, DdMandates>();
    
  • Use key=DdMandates.get(0)+DdMandates.get(1)

Note: If 50K records are exceeding the memory limit of HashMap create smaller batches.

  • For further performance enhancement you can use multi-threading by creating small batches and processing them on different threads.
  • Thanks @abhishek, as mentioned, I have two csv's. Master sheet (Main data file ) have 800 K records then I have a subset csv as well say it have 10 k records. This subset csv is updated from some other system and it have updated status say 'OK' and 'NOT OK'. I need to update this status in Master sheet. How can I do that in best possible way. ??? Dumbest way I am using is follwing : – SRana Jan 04 '19 at 13:35
  • What you can do, Store the 10K records in HashMap and now iterate over 800K records one by one and search in the HashMap.Now total complexity will be O(800K) with very high possibility. Even in the very worst case the complexity will be O(800K * log(10K)). Hope this will help – abhishek chaurasiya Jan 04 '19 at 15:48
  • And you don't need to store the Master sheet data in any list just update the 'item' object that you are creating from Master sheet record, and store it in new csv file with updated values. – abhishek chaurasiya Jan 04 '19 at 15:56
1

The first suggestion, when you create the ArrayList, it will make list capacity of 10. So, if you work with large amount of data, initialize it first like:

private static final int LIST_CAPACITY = 800000;
mandatesList = new ArrayList<DdMandates>(LIST_CAPACITY);

The second suggestion, don't store data in the memory, read the data line by line, make your business logic needs, then free up memory, like:

FileInputStream inputStream = null;
Scanner sc = null;
try {
    inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        /* your business rule here */
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
} finally {
    if (inputStream != null) {
        inputStream.close();
    }
    if (sc != null) {
        sc.close();
    }
}
BSeitkazin
  • 2,889
  • 25
  • 40
  • Thanks, Memory issue is resolved via doing it in chunks. reading single line and writing single line might result is taking more time. I didn't tried it as my issue was resolved with using batches of 100k records at time and clearing list after writing 100k records. – SRana Jan 04 '19 at 13:31