1

As the title states, I have a large Excel file (>200 sheets) that I need to add data to. I do not want to create new cells, I only want to modify existing ones.

I tried using Apache Poi but my application runs out of memory even with Xms and Xmx set to 8g. The only option for low-memory writing is seemingly with SXSSF. The problem is that it only works for creating new cells and does not allow modifying existing ones. I also tried using the event API in order to process the sheet's XML, but it only seems to work for read operations. I've been trying to use an XMLEventWriter but I can't find a way to access the sheets' XML data which works for writing. Is there a way to access an Excel file's XML data other than with XSSFReader?

jimturmoy
  • 33
  • 8
  • 3
    If the problem is only parsing all 200+ sheets to `XSSFSheet`s together, then you could overwriting `XSSFWorkbook.parseSheet` as shown in [HybridStreaming.java](https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/streaming/examples/). Using this you could have only one or some of the sheets parsed to `XSSFSheet`. But of course sometimes even this is not possible because all of the sheets depend on each other. – Axel Richter Jun 24 '19 at 15:29
  • 2
    And if even only single sheets are too big for `XSSFSheet`, then using the high level `apache poi` classes is not possible. Then only using pure `XML` methods will be possible. I have shown how to do that for reading and writing [using StAX](https://stackoverflow.com/search?q=user%3A3915431+stax). – Axel Richter Jun 24 '19 at 15:29
  • The problem mainly comes from writing. With POI I'm able to open the workbook and modify the cells, but the application crashes on the final call to workbook.write(). I wrote some code with StAX but I can't figure out how to access the sheets for writing. The sheets are easily accessible for reading by using XSSFReader to generate an InputStream which is then used to create the XML reader. However there are no classes to generate an Outputstream for writing. Writing to a regular .xml file would be simple but I can't figure out to pass my Excel document to an XML writer. – jimturmoy Jun 25 '19 at 06:52
  • 1
    I have linked my examples using `StAX`. They contain code for reading as well as writing. But of course none of those is general usable for reading and writing all kinds of `Excel` files. Each `Excel` workbook needs it's own code dependent on it's structure and on what content shall be changed. – Axel Richter Jun 25 '19 at 08:57
  • The examples that feature writing only do so at the end of the document from what I can tell. I am trying to modify existing elements/cells. Is this possible with StAX? – jimturmoy Jun 25 '19 at 10:26

1 Answers1

2

As told in comments above there is no one fits all solution using pure XML reading and writing the Office Open XML spreadsheets. Each Excel workbook needs it's own code dependent on it's structure and on what content shall be changed.

This is because apache poi's high level classes provides a meta level to avoid this. But this needs memory to work. And for very big workbooks it needs much memory. To avoid memory consumption through manipulating the XML directly this meta level is not usable. So one must know the XML structure of a worksheet and the meaning of the XML elements used.

So if we have a Excel workbook having a first sheet having strings in column A and numbers in column B, then we could changing every fifth row using StAX for manipulating the XML directly using following code:

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;

import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;

import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

import javax.xml.namespace.QName;

import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;

import java.util.regex.Pattern;

class StaxReadAndChangeTest {

 public static void main(String[] args) throws Exception {
  File file = new File("ReadAndWriteTest.xlsx");
  OPCPackage opcpackage = OPCPackage.open(file);

  //since there are strings in the sheet data, we need the SharedStringsTable
  PackagePart sharedstringstablepart = opcpackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
  SharedStringsTable sharedstringstable = new SharedStringsTable();
  sharedstringstable.readFrom(sharedstringstablepart.getInputStream());

  //get first worksheet
  PackagePart sheetpart = opcpackage.getPartsByName(Pattern.compile("/xl/worksheets/sheet1.xml")).get(0);

  //get XML reader and writer            
  XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(sheetpart.getInputStream());
  XMLEventWriter writer = XMLOutputFactory.newInstance().createXMLEventWriter(sheetpart.getOutputStream());

  XMLEventFactory eventFactory = XMLEventFactory.newInstance();

  int rowsCount = 0;
  int colsCount = 0;
  boolean cellAfound = false;
  boolean cellBfound = false;

  while(reader.hasNext()){ //loop over all XML in sheet1.xml
   XMLEvent event = (XMLEvent)reader.next();
   if(event.isStartElement()) {
    StartElement startElement = (StartElement)event;
    QName startElementName = startElement.getName();
    if(startElementName.getLocalPart().equalsIgnoreCase("row")) { //start element of row
     rowsCount++;
     colsCount = 0;
    } else if (startElementName.getLocalPart().equalsIgnoreCase("c")) { //start element of cell
     colsCount++;
     cellAfound = false;
     cellBfound = false;
     if (rowsCount % 5 == 0) { // every 5th row
      if (colsCount == 1) { // cell A
       cellAfound = true;
      } else if (colsCount == 2) { // cell B
       cellBfound = true;
      } 
     }
    } else if (startElementName.getLocalPart().equalsIgnoreCase("v")) { //start element of value
     if (cellAfound) {
      // create new rich text content for cell A
      CTRst ctstr = CTRst.Factory.newInstance();
      ctstr.setT("changed String Value A" + (rowsCount));
      //int sRef = sharedstringstable.addEntry(ctstr);
      int sRef = sharedstringstable.addSharedStringItem(new XSSFRichTextString(ctstr));
      // set the new characters for A's value in the XML
      if (reader.hasNext()) {
       writer.add(event); // write the old event
       event = (XMLEvent)reader.next(); // get next event - should be characters
       if (event.isCharacters()) {
        Characters value = eventFactory.createCharacters(Integer.toString(sRef));
        event = value;
       } 
      }        
     } else if (cellBfound) {
      // set the new characters for B's value in the XML
      if (reader.hasNext()) {
       writer.add(event); // write the old event
       event = (XMLEvent)reader.next(); // get next event - should be characters
       if(event.isCharacters()) { 
        double oldValue = Double.valueOf(((Characters)event).getData()); // old double value
        Characters value = eventFactory.createCharacters(Double.toString(oldValue * rowsCount));
        event = value;         
       }         
      }
     }
    }
   }
   writer.add(event); //by default write each read event
  }
  writer.flush();

  //write the SharedStringsTable
  OutputStream out = sharedstringstablepart.getOutputStream();
  sharedstringstable.writeTo(out);
  out.close();
  opcpackage.close();

 }
}

This will be much less memory consuming than apache poi's XSSF classes. But, as said, it only works exactly for this kind of Excel workbook having a first sheet having strings in column A and numbers in column B.

Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • I managed to make something that works using you example as inspiration, thanks for the help. The only thing I have left to do is figure out how to force the recalculation of the formulas in the sheet. Is there a flag in workbook.xml which does that? – jimturmoy Jun 26 '19 at 13:31
  • Found the answer to my previous question (forceFullCalc) here: https://c-rex.net/projects/samples/ooxml/e1/Part4/OOXML_P4_DOCX_calcPr_topic_ID0EWJ63.html . Hope this can help someone else – jimturmoy Jun 26 '19 at 14:23