Warning: long post due to examples and results
There have been threads here about how to read an Open XML spreadsheet row with null cells in between columns. I drew some of my answers from here reading Excel Open XML is ignoring blank cells
I'm able to read a table just fine from xlsx, but it is 10 times slower than reading from CSV, while the open XML structure should(?) yield superior results.
Here's what I got for testing code base:
foreach (Row r in sheetData.Descendants<Row>())
{
sw.Start();
//find a row marked as "header" and get list of columns that define width of table
if (!headerRowFound)
{
headerRowFound = CheckOXMLHeaderRow(r, workbookPart, out headerReferences);
if (!headerRowFound)
continue;
}
rowKey++;
//////////////////////////////////////////////////////////////////////
///////////////////here we are going to do work//////////////////////
////////////////////////////////////////////////////////////////////
AddRow(rowKey, cols);
sw.Stop();
Debug.WriteLine("XLSX Row added in \t" + sw.ElapsedTicks.ToString() + "\tticks");
sw.Reset();
}
In my data a row is 68 cells, with only 5-10 of them filled out
0. For comparison, going through CSV rows takes about 300 ticks (lightning fast). 5000 rows adds in 3ms
1. Code as is processes through row enumerators only in 1-4 ticks
2. This code simply grabs all cells sequentially and stores them in a row (column order is screwed up due to OXML nature)
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, c);
}
//this takes about 8-10 times longer - 10-30 ticks , still lightning fast
3. If we know where to look for based on column(header) name and row number, we can do this
Hashtable cols = new Hashtable();
foreach (string column in headerReferences.Values)
{
colKey++;
cols.Add(colKey, GetCellValue(workbookPart, worksheetPart, column + r.RowIndex.ToString()));
}
This is one of the MSDN examples and it's whooping 500,000 ticks per row. Took several minutes to parse a 5,000 row spreadsheet. Not acceptable. Here were were targeting EVERY cell in a row, existing or not
4. I decided to scale back and try to retrieve value from all incoming cells out of order into HashTable
Hashtable cols = new Hashtable();
foreach (Cell c in r.Descendants<Cell>())
{
colKey++;
cols.Add(colKey, GetValueFromCell(c, workbookPart));
}
This is now 500-1,500 ticks per row. Still, lightning fast if we just store the values without any order (not a solution yet)
5. To make sure i preserve the order of columns I make an empty clone of header row for every new row and after i parse through EXISTING cells, i decide where to put them based on Hashtable retrieval
Hashtable cols = (Hashtable)emptyNewRow.Clone();
foreach (Cell c in r.Descendants<Cell>())
{
colKey = headerReferences[GetColumnName(c.CellReference)]; //what # column is this?
cols[colKey] = GetValueFromCell(c, workbookPart); //put value in that column
}
Final result is 9,000-20,000 ticks per row. 30s for 5,000 spreadsheet. Doable, but not ideal.
Here's where I stopped. Any ideas how to make it faster? How can humongous xlsx spreadsheets load so lightning fast and best i can do here is 30s for 5k rows??
Dictionaries didn't do anything for me, not even 1% improvement. And I need result in Hashtables anyway for legacy retrofit
Appendix: referenced methods
public static string GetColumnName(string cellReference)
{
// Match the column name portion of the cell name.
Regex regex = new Regex("[A-Za-z]+");
Match match = regex.Match(cellReference);
return match.Value;
}
public static string GetValueFromCell(Cell cell, WorkbookPart workbookPart)
{
int id;
string cellValue = cell.InnerText;
if (cellValue.Trim().Length > 0)
{
if (cell.DataType != null)
{
switch (cell.DataType.Value)
{
case CellValues.SharedString:
Int32.TryParse(cellValue, out id);
SharedStringItem item = GetSharedStringItemById(workbookPart, id);
if (item.Text != null)
{
cellValue = item.Text.Text;
}
else if (item.InnerText != null)
{
cellValue = item.InnerText;
}
else if (item.InnerXml != null)
{
cellValue = item.InnerXml;
}
break;
case CellValues.Boolean:
switch (cellValue)
{
case "0":
cellValue = "FALSE";
break;
default:
cellValue = "TRUE";
break;
}
break;
}
}
else
{
int excelDate;
if (Int32.TryParse(cellValue, out excelDate))
{
var styleIndex = (int)cell.StyleIndex.Value;
var cellFormats = workbookPart.WorkbookStylesPart.Stylesheet.CellFormats;
var numberingFormats = workbookPart.WorkbookStylesPart.Stylesheet.NumberingFormats;
var cellFormat = (CellFormat)cellFormats.ElementAt(styleIndex);
if (cellFormat.NumberFormatId != null)
{
var numberFormatId = cellFormat.NumberFormatId.Value;
var numberingFormat = numberingFormats.Cast<NumberingFormat>().SingleOrDefault(f => f.NumberFormatId.Value == numberFormatId);
if (numberingFormat != null && numberingFormat.FormatCode.Value.Contains("/yy")) //TODO here i should think of locales
{
DateTime dt = DateTime.FromOADate(excelDate);
cellValue = dt.ToString("MM/dd/yyyy");
}
}
}
}
}
return cellValue;
}
public static string GetCellValue(WorkbookPart wbPart, WorksheetPart wsPart, string addressName)
{
string value = String.Empty; //code from microsoft prefers null, but null is tough to work with
// Use its Worksheet property to get a reference to the cell
// whose address matches the address you supplied.
Cell theCell = wsPart.Worksheet.Descendants<Cell>().
Where(c => c.CellReference == addressName).FirstOrDefault();
// If the cell does not exist, return an empty string.
if (theCell != null)
{
value = theCell.InnerText;
// If the cell represents an integer number, you are done.
// For dates, this code returns the serialized value that
// represents the date. The code handles strings and
// Booleans individually. For shared strings, the code
// looks up the corresponding value in the shared string
// table. For Booleans, the code converts the value into
// the words TRUE or FALSE.
if (theCell.DataType != null)
{
switch (theCell.DataType.Value)
{
case CellValues.SharedString:
// For shared strings, look up the value in the shared strings table.
var stringTable = wbPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
// If the shared string table is missing, something is wrong. Return the index that is in the cell.
//Otherwise, look up the correct text in the table.
if (stringTable != null)
{
value = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}
break;
case CellValues.Boolean:
switch (value)
{
case "0":
value = "FALSE";
break;
default:
value = "TRUE";
break;
}
break;
}
}
}
return value;
}