0

I am trying to create a simple Java program to convert Excel files from inline strings to shared strings table to reduce the file size.

I know Apache POI has a SXSSFWorkbook class that can do the job, but to read large xlsx files with inline strings using the SAX XML parser can still blow up. For example 150,000 rows x 50 columns cells.

Is there a simple solution without using Apache POI libraries to do the simple job? Anyone knows?

Trevor
  • 218
  • 3
  • 15
  • "...to do the simple job": It is the opposite of a simple job ;-). Using inline strings all strings are in the cells of the sheet's XML file. To create a `sharedStrings.xml` the need is: Loop over all cells in the sheet's XML file to get the inline strings. Then lookup the `sharedStrings.xml` wheter the string is already there. If so, get the ID, else create a new string in `sharedStrings.xml` and get the ID. Then put the ID into the cell in the sheet's XML file instead of the inline string value. Then next cell in the sheet's XML file. – Axel Richter Dec 20 '17 at 06:20

2 Answers2

4

Although first creating sheets having inline strings in them and then replacing those inline strings with shared strings will be horrible inefficient, I will providing an answer to the question how it could be done.

The need is: Loop over all cells in the sheet's XML file to get the inline strings. Then lookup the sharedStrings.xml whether the string is already there. If so, get the ID, else create a new string in sharedStrings.xml and get the ID. Then put the ID into the cell in the sheet's XML file instead of the inline string value.

The following code is doing this. If TestInlineStrings.xlsx has inline strings in first sheet, then those inline strings will be replaced with shared strings after this code had run.

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;

import org.apache.poi.xssf.model.SharedStringsTable;

import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.XMLEvent;

import javax.xml.namespace.QName;

import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
import java.util.Iterator;

class StaxReplaceInlineStrings {

 public static void main(String[] args) {
  try {

   File file = new File("TestInlineStrings.xlsx");
   OPCPackage opcpackage = OPCPackage.open(file);

   //if there are strings in the sheet data, we need the SharedStringsTable
   PackagePart sharedstringstablepart = opcpackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
   SharedStringsTable sharedstringstable = new SharedStringsTable();
   sharedstringstable.readFrom(sharedstringstablepart.getInputStream());

   PackagePart sheetpart = opcpackage.getPartsByName(Pattern.compile("/xl/worksheets/sheet1.xml")).get(0);

   XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(sheetpart.getInputStream());
   XMLEventWriter writer = XMLOutputFactory.newInstance().createXMLEventWriter(sheetpart.getOutputStream());

   XMLEventFactory eventFactory = XMLEventFactory.newInstance();


   while(reader.hasNext()){ //loop over all XML in sheet1.xml

    boolean cellReplaced = false; //marker whether cell having inline string was replaced by cell having shared string

    XMLEvent event = (XMLEvent)reader.next();
    if(event.isStartElement()){
     StartElement startElement = (StartElement)event;
     QName startElementName = startElement.getName();
     if (startElementName.getLocalPart().equalsIgnoreCase("c")) { //start element of cell
      Attribute attribute;
      StartElement cellStart = startElement; //remember cell start
      Iterator attributeIterator = cellStart.getAttributes(); //get cell's attributes
      while (attributeIterator.hasNext()) {
       attribute = (Attribute)attributeIterator.next();
       if ("t".equals(attribute.getName().getLocalPart())) { //cell has type attribute
        String tvalue = attribute.getValue();
        if ("inlineStr".equals(tvalue)) { //cell type is inline string
         String inlineString = "";
         startElement = (StartElement)(XMLEvent)reader.next(); //read next start element - error if is not a start element
         startElementName = startElement.getName();
         if (startElementName.getLocalPart().equalsIgnoreCase("is")) { //start element of inline string     
          startElement = (StartElement)(XMLEvent)reader.next(); //read next start element - error if is not a start element
          startElementName = startElement.getName();
          if (startElementName.getLocalPart().equalsIgnoreCase("t")) { //start element of text
           Characters characters = (Characters)(XMLEvent)reader.next(); //read next characters element - error if is not a characters element   
           inlineString = characters.getData(); //get text data  
System.out.println(inlineString); 
          }
         }

         //create shared string in shared strings table
         CTRst ctstr = CTRst.Factory.newInstance();
         ctstr.setT(inlineString);
         int sRef = sharedstringstable.addEntry(ctstr);

         //we are replacing the cell element so skip elements until end element of cell
         while(reader.hasNext()) {
          event = (XMLEvent)reader.next();
          if(event.isEndElement()){
           EndElement endElement = (EndElement)event;
           QName endElementName = endElement.getName();
           if (endElementName.getLocalPart().equalsIgnoreCase("c")) { //end element of cell 
            break;
           }
          }
         }

         //create the new cell element having the shared string
         Attribute r = cellStart.getAttributeByName(new QName("r"));
         Attribute s = cellStart.getAttributeByName(new QName("s"));
         Attribute t = eventFactory.createAttribute("t", "s");
         List attributeList = Arrays.asList(new Attribute[]{t});
         if (r != null && s != null) {
          attributeList = Arrays.asList(new Attribute[]{r, s, t});
         } else if (r != null) {
          attributeList = Arrays.asList(new Attribute[]{r, t});
         } else if (s != null) {
          attributeList = Arrays.asList(new Attribute[]{s, t});
         }
System.out.println(attributeList);
         StartElement newCellStart = eventFactory.createStartElement(new QName("c"), attributeList.iterator(), null);
         writer.add(newCellStart);
         StartElement newCellValue = eventFactory.createStartElement(new QName("v"), null, null);
         writer.add(newCellValue);
         Characters value = eventFactory.createCharacters(Integer.toString(sRef));
         writer.add(value);         
         EndElement newCellValueEnd = eventFactory.createEndElement(new QName("v"), null);
         writer.add(newCellValueEnd);
         EndElement newCellEnd = eventFactory.createEndElement(new QName("c"), null);
         writer.add(newCellEnd);

         cellReplaced = true; // mark that cell was replaced
         break;
        }
       } 
      }
     }
    }
    if (!cellReplaced) {
     writer.add(event); //by default write each read event, except cell was replaced
    }
   }
   writer.flush();

   //write the SharedStringsTable
   OutputStream out = sharedstringstablepart.getOutputStream();
   sharedstringstable.writeTo(out);
   out.close();

   opcpackage.close();

  } catch (Exception ex) {
     ex.printStackTrace();
  }
 }
}
Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • Thanks Axel. It is what I was looking for. Is it possible to know the number of elements in advance before calling the XMLEventReader iterator (yours is reader)? – Trevor Jan 01 '18 at 03:35
  • 1
    @Trevor: No, this is event based and this also is the reason for the very low memory consumption. And [StAX](https://docs.oracle.com/javase/tutorial/jaxp/stax/why.html) is forward only. So if one needs to know the count of events, then this one must iterate over all at least once. See also https://stackoverflow.com/questions/9720195/what-is-the-best-way-to-get-the-count-length-size-of-an-iterator. – Axel Richter Jan 01 '18 at 07:50
  • Axel: Yes that makes sense. Now this StAX solution works very well for my needs. Thank you. It takes 59 seconds (using Intel Core i5 CPU) to replace the inline strings in a large xlsx spreadsheet with 116575 rows x 46 columns (over 5.3 million cells). Not bad. – Trevor Jan 02 '18 at 04:32
  • @AxelRichter I have another doubt. http://poi.apache.org/components/spreadsheet/how-to.html#sxssf says "inline strings instead of a shared strings table" is more efficient but here it seems that shared sharing is better. Could you please explain both situations – nantitv Sep 28 '18 at 13:30
  • 1
    @nantitv: Using inline strings instead of shared strings will be better for streaming performance since then the additional shared strings table needs not to be used. But then each single string occurrence needs to be stored separately in the sheet even if it occurs 1000ds of times. Using the shared strings table avoids this. There a string is only stored once and only it's id then is stored 1000ds of times in the sheet if needed. – Axel Richter Sep 28 '18 at 13:51
  • @AxelRichter Thanks. Could you please please explain what is really streaming in SXSSF. The official doc never explain that. It will be great if you could give some details about SXSSF – nantitv Sep 28 '18 at 14:15
  • 1
    @nantitv: Well it is explained in https://poi.apache.org/components/spreadsheet/how-to.html#sxssf. "Older rows that are no longer in the window become inaccessible, as they are written to the disk. " means for each `SXSSFSheet` there is a temporary file in which the row data will be streamed in. At the end of the process all temporary files are combined in a workbook. – Axel Richter Sep 28 '18 at 14:36
1

In Adding a row to a large xlsx file (Out of Memory) I have provided an approach using StAX for writing rows into a Excel sheet without need to opening the whole workbook. But shared strings table is used.

So here is a slightly modified version.

You will start having a ReadAndWriteTest.xlsx like this:

enter image description here

And each time you run the code, 100,000 rows will be added having a random string in column A and a random double value in column B. The strings will be managed by a shared strings table. So there will be much less unique strings in this shared strings table than strings are in the sheet in sum.

I have this approach in productive usage, for sure more complex and more structured in code, since this code sample only shall show the approach in simple code. And it works well and is more performant than SXSSF and provides reading and writing.

import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;

import org.apache.poi.xssf.model.SharedStringsTable;

import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.XMLEvent;

import javax.xml.namespace.QName;

import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

import java.util.concurrent.ThreadLocalRandom;

class StaxReadAndWriteTest {

 public static void main(String[] args) {
  try {

   String loremipsum = "Lorem ipsum dolor sit amet ne mei euismod interpretaris est te iusto causae doctus.";

   File file = new File("ReadAndWriteTest.xlsx");
   OPCPackage opcpackage = OPCPackage.open(file);

   //if there are strings in the sheet data, we need the SharedStringsTable
   PackagePart sharedstringstablepart = opcpackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
   SharedStringsTable sharedstringstable = new SharedStringsTable();
   sharedstringstable.readFrom(sharedstringstablepart.getInputStream());

   PackagePart sheetpart = opcpackage.getPartsByName(Pattern.compile("/xl/worksheets/sheet1.xml")).get(0);

   XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(sheetpart.getInputStream());
   XMLEventWriter writer = XMLOutputFactory.newInstance().createXMLEventWriter(sheetpart.getOutputStream());

   XMLEventFactory eventFactory = XMLEventFactory.newInstance();

   int rowsCount = 0;

   while(reader.hasNext()){ //loop over all XML in sheet1.xml
    XMLEvent event = (XMLEvent)reader.next();
    writer.add(event); //by default write each readed event

    if(event.isStartElement()){
     StartElement startElement = (StartElement)event;
     QName startElementName = startElement.getName();
     if(startElementName.getLocalPart().equalsIgnoreCase("row")) { //start element of row
      boolean rowStart = true;
      rowsCount++;
      do {
       event = (XMLEvent)reader.next(); //find this row's end
       writer.add(event); //by default write each readed event

       if(event.isEndElement()){
        EndElement endElement = (EndElement)event;
        QName endElementName = endElement.getName();
        if(endElementName.getLocalPart().equalsIgnoreCase("row")) { //end element of row
         rowStart = false;
         //we assume that there is nothing else (character data) between end element of row and next element 
         XMLEvent nextElement = (XMLEvent)reader.peek();
         QName nextElementName = null;
         if (nextElement.isStartElement()) nextElementName = ((StartElement)nextElement).getName();
         else if (nextElement.isEndElement()) nextElementName = ((EndElement)nextElement).getName();
         if(!nextElementName.getLocalPart().equalsIgnoreCase("row")) { //next is not start element of row
          //we have the last row, so we write new rows now 

          for (int i = 0; i < 100000; i++) {

           StartElement newRowStart = eventFactory.createStartElement(new QName("row"), null, null);
           writer.add(newRowStart);

//start cell A
           Attribute attribute = eventFactory.createAttribute("t", "s");
           List attributeList = Arrays.asList(attribute);
           StartElement newCellStart = eventFactory.createStartElement(new QName("c"), attributeList.iterator(), null);
           writer.add(newCellStart);

           CTRst ctstr = CTRst.Factory.newInstance();

           //create a random string from loremipsum
           int length = ThreadLocalRandom.current().nextInt(5, 20);
           int index = ThreadLocalRandom.current().nextInt(0, loremipsum.length() - length);
           //set randoom string in CTRst
           ctstr.setT(loremipsum.substring(index, index + length).trim());
           //update SharedStringsTable with CTRst and get sRef as the ID of this string
           int sRef = sharedstringstable.addEntry(ctstr);

           StartElement newCellValue = eventFactory.createStartElement(new QName("v"), null, null);
           writer.add(newCellValue);

           //set sRef of the string as content of cell A
           Characters value = eventFactory.createCharacters(Integer.toString(sRef));
           writer.add(value);         

           EndElement newCellValueEnd = eventFactory.createEndElement(new QName("v"), null);
           writer.add(newCellValueEnd);

           EndElement newCellEnd = eventFactory.createEndElement(new QName("c"), null);
           writer.add(newCellEnd);
//end cell A
//start cell B
           newCellStart = eventFactory.createStartElement(new QName("c"), null, null);
           writer.add(newCellStart);

           newCellValue = eventFactory.createStartElement(new QName("v"), null, null);
           writer.add(newCellValue);

           //set random double value as content of cell B
           value = eventFactory.createCharacters(""+ThreadLocalRandom.current().nextDouble((double)length));
           writer.add(value);         

           newCellValueEnd = eventFactory.createEndElement(new QName("v"), null);
           writer.add(newCellValueEnd);

           newCellEnd = eventFactory.createEndElement(new QName("c"), null);
           writer.add(newCellEnd);
//end cell B

           EndElement newRowEnd = eventFactory.createEndElement(new QName("row"), null);
           writer.add(newRowEnd);

           rowsCount++;
          }
         }
        }
       }
      } while (rowStart);
     }
    }
   }

   writer.flush();

   //write the SharedStringsTable
   OutputStream out = sharedstringstablepart.getOutputStream();
   sharedstringstable.writeTo(out);
   out.close();

   opcpackage.close();

  } catch (Exception ex) {
     ex.printStackTrace();
  }
 }
}
Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • Thanks Axel for the ideas. I noticed the code creates a series of random sub-strings that come from the `loremipsum` string. My aim is to just convert from the inline string based xlsx to the shared strings table based xlsx output where each word is not random. I think your code is just to show as an example how you can create the shared strings table? So how do you update `sharedStrings.xml` and `sheet1.xml` in one go? – Trevor Dec 24 '17 at 03:16
  • 1
    @Trevor: The code shows how writing rows into a Excel sheet without need to opening the whole workbook. But shared strings table is used. So memory usage is minimized without the disadvantage of inline strings. First creating sheets having inline strings in them and then replacing those inline strings with shared strings will be horrible inefficient. This is not was the code is for. But see my other answer. – Axel Richter Dec 24 '17 at 08:19