0

Goal: Compare two excel files, each excel is having only one column but having half million rows. I just want to see the un-common values between two excels.

Column: only 1 Rows: 500,000

Language: JAVA

What I tried so far:

  1. Apache POI used (please see below program-1)
  2. ArrayList (please see below program-2)
  3. I tried some websites where we can upload the multiples Excel to see the differences (https://www.textcompare.org/excel/)
  4. I increased heap size to 256 to 2048m, but not working for me.

(My programs are working good for small set of data or few records)

above efforts did not work for me.

program-1::

import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;

public static ExcelReader excel3 = null;
public static ExcelReader excel2 = null;

excel3 = new ExcelReader("C:\\Users\\DataExcelCompare\\Book3.xlsx");
excel2 = new ExcelReader("C:\\Users\\DataExcelCompare\\Book2.xlsx");

    File f3 = new File("C:\\Users\\DataExcelCompare\\Book3.xlsx");      
        FileInputStream fi3 = new FileInputStream(f3);
        Workbook workbook3 = WorkbookFactory.create(fi3);
        Sheet sheet3 = workbook3.getSheet("book3");     

        File f2 = new File("C:\\Users\\DataExcelCompare\\Book2.xlsx");      
        FileInputStream fi2 = new FileInputStream(f2);
        Workbook workbook2 = WorkbookFactory.create(fi2);
        Sheet sheet2 = workbook2.getSheet("book2"); 

         int firstRow2 = sheet2.getFirstRowNum();
         int lastRow2 = sheet2.getLastRowNum();

         int firstRow3 = sheet3.getFirstRowNum();
         int lastRow3 = sheet3.getLastRowNum();

            for(int i=firstRow2; i <= lastRow2; i++) { 
             for(int j=firstRow3; j <= lastRow3; j++) {

                String ele2 = sheet2.getRow(i).getCell(0).toString().trim();
                String ele3 = sheet3.getRow(j).getCell(0).toString().trim() ;

                 if(ele2.toString() !=  ele3.toString())
                 {
                    // some operation 
                 }
                 else
                 {
                    // some operation   
                 }

             }

program-2::

ArrayList<String> listOne = new ArrayList<String>();
ArrayList<String> listTwo =new ArrayList<String>();

         for(int i=firstRow2; i <= lastRow2; i++) {
             listOne.add(sheet2.getRow(i).getCell(0).toString().trim());
             System.out.println("added: " + sheet2.getRow(i).getCell(0).toString().trim());
         }

         for(int j=firstRow3; j <= lastRow3; j++) {
             listTwo.add(sheet3.getRow(j).getCell(0).toString().trim());
             System.out.println("added: " + sheet3.getRow(j).getCell(0).toString().trim());
         }

            listTwo.removeAll(listOne);

            System.out.println("list two : " + listTwo);
Mike ASP
  • 2,013
  • 2
  • 17
  • 24

1 Answers1

0

Reading Excel files using Apache POI is extremely memory intensive.

Memory errors when working with Excel files with less than 50,000 rows is not uncommon (let alone 500,000 x 2).

It's never going to work.

Find a 3rd party program to convert those Excel files to .txt's before they reach your Java program.

Surronie
  • 86
  • 3