I have a csv file 'Master List' with 800 K records, each record have 13 values. combination of cell[0] and cell[1] give a unique record and I need to update value of cell[12] say status for every record.
I have another csv file say 'Updated subset list'. This is sort of subset of file 'Master list'. For all the records in my 2nd csv which are less in number say 10000, I need to update cell[11] aka status column value of each matching record.
I tried direct BufferedReader, CsvParser from commons-csv and CsvParser from univocity.parsers. But reading whole file and creating List of 800K is giving out of memory exception.
Same code will be deployed on different servers so I want to have a efficient code for reading huge csv file and updating same file.
Partially reading huge file and writing in same file might corrupt the data.
Any suggestions on how can I do this. ??
File inputF = new File(inputFilePath);
if (inputF.exists()) {
InputStream inputFS = new FileInputStream(inputF);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS));
// skip the header of the file
String line = br.readLine();
mandatesList = new ArrayList<DdMandates>();
while ((line = br.readLine()) != null) {
mandatesList.add(mapToItem(line));
}
br.close();
}
Memory issue is resolved via doing it in chunks. reading single line and writing single line might result is taking more time. I didn't tried it as my issue was resolved with using batches of 100k records at time and clearing list after writing 100k records
Now issue is updating status is taking too much looping....
I have two csv's. Master sheet (Master list) have 800 K records then I have a subset csv as well say it have 10 k records. This subset csv is updated from some other system and it have updated status say 'OK' and 'NOT OK'. I need to update this status in Master sheet. How can I do that in best possible way. ??? Dumbest way I am using is follwing : –
// Master list have batches but it contains 800 k records and 12 columns
List<DdMandates> mandatesList = new ArrayList<DdMandates>();
// Subset list have updated status
List<DdMandates> updatedMandatesList = new ArrayList<DdMandates>();
// Read Subset csv file and map DdMandates item and then add to updated mandate list
File inputF = new File(Property.inputFilePath);
if(inputF.exists()) {
InputStream inputFS = new FileInputStream(inputF);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS, "UTF-8"));
checkFilterAndmapToItem(br);
br.close();
In Method checkFilterAndmapToItem(BufferedReader br)
private static void checkFilterAndmapToItem(BufferedReader br) {
FileWriter fileWriter = null;
try {
// skip the header of the csv
String line = br.readLine();
int batchSize = 0, currentBatchNo=0;
fileWriter = new FileWriter(Property.outputFilePath);
//Write the CSV file header
fileWriter.append(FILE_HEADER.toString());
//Add a new line separator after the header
fileWriter.append(NEW_LINE_SEPARATOR);
if( !Property.batchSize.isEmpty()) {
batchSize = Integer.parseInt(Property.batchSize.trim());
}
while ((line = br.readLine()) != null) {
DdMandates item = new DdMandates();
String[] p = line.concat(" ").split(SEPERATOR);
Parse each p[x] and map to item of type DdMandates\
Iterating here on updated mandate list to check if this item is present in updated mandate list
then get that item and update that status to item . so here is a for loop for say 10K elements
mandatesList.add(item);
if (batchSize != 0 && mandatesList.size() == batchSize) {
currentBatchNo++;
logger.info("Batch no. : "+currentBatchNo+" is executing...");
processOutputFile(fileWriter);
mandatesList.clear();
}
}
processing output file here for the last batch ...
}
It will have while loop (800 K iteration) { insider loop 10K iteration for each element )
so at least 800K * 10K loop
Please help in getting its best possible way and reduce iteration .
Thanks in advance