2

In an xlsx workbook there are some cells with some unbounded SUMIF formulas like this: SUMIF(MySheetname!$B:$B,$E4,MySheetname!$I:$I). The evaluation of one SUMIF function with Apache POI 5.0.0 lasts 100ms and the evaluation of the given workbook lasts some minutes.

One approach to improve the execution duration is to bound the formulas to something like this: SUMIF(MySheetname!$B1:$B100,$E4,MySheetname!$I1:$I100). In my case this is not a solution as I'm not the author of the xlsx files and the system gets unknown xlsx files from unknown persons (so I can't just tell them to limit the SUMIF ranges).

The current implementation of org.apache.poi.ss.formula.functions.Sumif iterates all cells in the given (unbounded) ranges, so every evaluation iterates 1048576 cells.

This is part of the implementation of method sumMatchingCells(AreaEval, I_MatchPredicate, AreaEval) :

for (int r=0; r<height; r++) {
    for (int c=0; c<width; c++) {
        result += accumulate(aeRange, mp, aeSum, r, c);
    }
}

I would like to improve the performance of this method by checking whether the row or the column actually exists in the sum range. Maybe something like this (with a non existing method sheetContainsRowIndex):

for (int r = 0; r < height; r++) {
    if (aeSum.sheetContainsRowIndex(aeSum.getFirstRow() + r)) {
        for (int c = 0; c < width; c++) {
            if (aeSum.sheetContainsColumnIndex(aeSum.getFirstColumn() + c)) {
               [...]

The LazyAreaEval contains a SheetRangeEvaluator and this contains SheetRefEvaluator s and these contain an EvaluationSheet and this knows at least the getLastRowNum(). Unfortunately this attribute chain is private.

Any idea how to achieve this? Or any other idea how to improve performance for SUMIF execution?

  • There was an answer from Axel in the noon today that iterated every formula and limited every AreaPtgBase.lastRow to the last row in the sheet. That seemed to work fine in my cases (as long as the areas start at row index 0). A little side effect: `MATCH(G$3,mysheetname!$2:$2,0)` became `MATCH(G$3,mysheetname!$A$2:$XFD$2,0)`. LibreOffice 7.0.5.2 showed #NAME messages for that because the last column is AMJ. Because of this an extension likewise for the column is necessary. Evaluating one sheet with only a few rows took 63 ms instead of 35000 ms. Great. Thank you, Axel! –  May 01 '21 at 20:58
  • I wonder whether there is some case where this does not work because of asymmetric area changes. Maybe `SUMIF(A1:A9990;"abc";C3:C9993)` would be changed to `SUMIF(A1:A5;"abc";C3:C9993)` –  May 01 '21 at 20:58
  • 1
    See my current answer. It changes only full column references in formulas. So full row references and other column references will not be affected even if they start at row index 0. And it differentiates now between 2D-reference in same sheet and 3D-referenced to another sheet. Because then last row of the other sheet must be taken. – Axel Richter May 02 '21 at 07:32

1 Answers1

1

Patching the apache poi formula evaluation needs a deep immersion into the sources and rummaging through the evaluation process. That is nothing I will do.

But a workaround could be to replace all full column references in formulas by area references from row 1 to last row in sheet before evaluation.

If you only read the workbook then this only affects the random access memory and not the stored file. Of course, if you need to save the changed workbook, then it will affect the stored file. Then the workaround might be not usable.

This has noticeable effect on process duration when there are multiple formulas having full column references in the worksheet, at least using *.xlsx (XSSF) and although the additional replacement process for each formula needs be done.

Complete code example:

import java.io.FileInputStream;

import org.apache.poi.ss.formula.*;
import org.apache.poi.ss.formula.ptg.*;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.ss.SpreadsheetVersion;

public class ExcelEvaluateFullColumnFormulas {

 private static String replaceFullColumnReferences(XSSFSheet sheet, String formula) {
  //System.out.println(formula);
     
  XSSFWorkbook workbook = sheet.getWorkbook();
  XSSFEvaluationWorkbook evaluationWorkbook = XSSFEvaluationWorkbook.create(workbook);
  
  Ptg[] ptgs = FormulaParser.parse(formula, (FormulaParsingWorkbook)evaluationWorkbook, 
   FormulaType.CELL, sheet.getWorkbook().getSheetIndex(sheet));
   
  for (int i = 0; i < ptgs.length; i++) {
   if (ptgs[i] instanceof AreaPtgBase) { // the operand Ptg is an area reference
    AreaPtgBase ref = (AreaPtgBase) ptgs[i];
    if (ref.getFirstRow() == 0 && ref.getLastRow() == SpreadsheetVersion.EXCEL2007.getLastRowIndex()) { // only for full column area references
     int lastRowInSheet = SpreadsheetVersion.EXCEL2007.getLastRowIndex();
     if (ref instanceof Area2DPtgBase) { // the area reference is a 2D area reference in same sheet
      lastRowInSheet = sheet.getLastRowNum(); // get last row of this sheet
     } else if (ref instanceof Area3DPxg) { // the area reference is a 3D area reference in another sheet
      Area3DPxg ref3D = (Area3DPxg)ref; 
      String sheetName = ref3D.getSheetName();
      lastRowInSheet = workbook.getSheet(sheetName).getLastRowNum(); // get last row of referenced sheet
     }      
     ref.setLastRow(lastRowInSheet);
     formula = FormulaRenderer.toFormulaString((FormulaRenderingWorkbook)evaluationWorkbook, ptgs);
    }
   }
  }
  //System.out.println(formula);
  return formula;
  
 }

 public static void main(String[] args) throws Exception {

  DataFormatter formatter = new DataFormatter();
  Workbook workbook = WorkbookFactory.create(new FileInputStream("test.xlsx"));  
  FormulaEvaluator evaluator = workbook.getCreationHelper().createFormulaEvaluator();

  Sheet sheet = workbook.getSheetAt(0);
  java.time.LocalDateTime startTime = java.time.LocalDateTime.now();
  for (Row row : sheet) {
   for (Cell cell : row) {
    ///*
    if (cell.getCellType() == CellType.FORMULA) {
     if (sheet instanceof XSSFSheet){ // do it for XSSF only, not necessary for HSSF.
      String formula = cell.getCellFormula();
      formula = replaceFullColumnReferences((XSSFSheet)sheet, formula);
      cell.setCellFormula(formula);
     }
    }
    //*/
    String value = formatter.formatCellValue(cell, evaluator);
    System.out.print(value + "\t");
   }
   System.out.println();
  }
  
 java.time.LocalDateTime endTime = java.time.LocalDateTime.now();
 java.time.Duration duration = java.time.Duration.between(startTime, endTime);
 System.out.println("process duration: " + duration);
 
 workbook.close();
 }
}

Comment out the part

...
    /*
    if (cell.getCellType() == CellType.FORMULA) {
     if (sheet instanceof XSSFSheet){ // do it for XSSF only, not necessary for HSSF.
      String formula = cell.getCellFormula();
      formula = replaceFullColumnReferences((XSSFSheet)sheet, formula);
      cell.setCellFormula(formula);
     }
    }
    */
...

to see the difference.

Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • Works great! Thanks :-) Extension for columns: 1) Collect last column index by sheet name: Map lastColumnIndexBySheetName = new HashMap<>(); for (int i = 0; i < workbook.getNumberOfSheets(); i++) { Sheet sheet = workbook.getSheetAt(i); int lastColumn = 0; for (Row row : sheet) {lastColumn = Math.max(lastColumn, row.getLastCellNum());} lastColumnIndexBySheetName.put(sheet.getSheetName(), lastColumn); } –  May 03 '21 at 20:42
  • 2) Use it likewise Axel's row if (ref.getFirstColumn() == 0 && ref.getLastColumn() == SpreadsheetVersion.EXCEL2007.getLastColumnIndex()) { int lastColumnInSheet = SpreadsheetVersion.EXCEL2007.getLastColumnIndex(); if (ref instanceof Area2DPtgBase) { lastColumnInSheet = lastColumnIndexBySheetName.get(sheet.getSheetName()); [...] –  May 03 '21 at 20:43
  • [...] } else if (ref instanceof Area3DPxg) { Area3DPxg ref3D = (Area3DPxg) ref; String sheetName = ref3D.getSheetName(); lastColumnInSheet = lastColumnIndexBySheetName.get(sheetName); } ref.setLastColumn(lastColumnInSheet); formula = FormulaRenderer.toFormulaString(evaluationWorkbook, ptgs); } –  May 03 '21 at 20:43