1

Introduction

First, let me explain that this is not as much a question as it is a useful tip through my short time using OpenXml. However, all comments are welcome. Also, this is posted as a question because I could not figure out how to post a "suggestion" or "best practice" post.

The code taken below is directly from the MSDN do-it-yourself documentation. MSDN Do-it-yourself

...

// If the cell does not exist, return an empty string:
if (theCell != null)
{
  value = theCell.InnerText;

  // If the cell represents a numeric value, you are done. 
  // For dates, this code returns the serialized value that 
  // represents the date. The code handles strings and Booleans
  // individually. For shared strings, the code looks up the 
  // corresponding value in the shared string table. For Booleans, 
  // the code converts the value into the words TRUE or FALSE.
  if (theCell.DataType != null)
  {
    switch (theCell.DataType.Value)
    {
      case CellValues.SharedString:
        // For shared strings, look up the value in the shared 
        // strings table.
        var stringTable = wbPart.
          GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
        // If the shared string table is missing, something is 
        // wrong. Return the index that you found in the cell.
        // Otherwise, look up the correct text in the table.
        if (stringTable != null)
        {
          value = stringTable.SharedStringTable.
            ElementAt(int.Parse(value)).InnerText;
        }
        break;

      case CellValues.Boolean:
        switch (value)
        {
          case "0":
            value = "FALSE";
            break;
          default:
            value = "TRUE";
            break;
        }
        break;
    }
  }

...

The Bottleneck

Now, lets look at this line in particular. If you are simply reading a very large excel document and extracting data from the cells in the document, then this line will cause a performance bottleneck since it will iterate the entire SharedStringTable every time you try to get the string value from a cell.

value = stringTable.SharedStringTable.
            ElementAt(int.Parse(value)).InnerText;

An Optimized Solution

Insert the following field in your class.

private OpenXmlElement[] sharedStringArray;

And once you have opened your XmlSpreadSheet, initialize the array once. Of course, this approach will use more memory, but I have found that the amount of memory used is far less valuable than the execution time with the MSDN approach.

var stringTable = wbPart.
          GetPartsOfType<SharedStringTablePart>().FirstOrDefault();

this.sharedStringArray = stringTable.SharedStringTable.ToArray();

Now we may reference directly to the array instead of iterating all the objects in the enumeration.

value = this.sharedStringArray[int.Parse(value)].InnerText;

From this approach, I am assuming that the shared string table will not be modified. I am also assuming that you may be getting the value of many million cells. This is a simple modification to help you achieve the execution time desired by forfeiting a small amount of memory.

I hope you enjoy!

Tanner Bragg
  • 21
  • 1
  • 3
  • This is not a question. See your own words "this is not as much a question as it is a useful tip". – djv Oct 07 '14 at 18:51
  • Joking aside, the standard way to do this is to pose a question, maybe give the community some time to answer, then add an answer yourself. Here's just one example of this: http://stackoverflow.com/questions/25969/sql-insert-into-values-select-from – djv Oct 07 '14 at 18:53
  • Thanks, Dan. I was not sure how people typically posted such suggestions. I felt compelled to share this since it was trivial, and it shortened my execution time from several hours to just over 5 minutes. I use a Simple Api Xml (SAX) approach with this solution, but this DOM approach is good enough to demonstrate the potential bottleneck. – Tanner Bragg Oct 07 '14 at 18:59

0 Answers0