1

I just read a big xlsx file using the DOM approach with Open XML SDK. It works fine; however, it takes forever to do so. So I wanted to do the same thing using the SAX approach. However, I am not getting anywhere on this one. What I do in the DOM approach is that for each worksheet in the workbook I get the name of the sheet. Then I assume that the first row has all of the column names. Next, I create a class on-the-fly that has all of the properties that are listed in the first row. After that, I read the rest of the rows. For each row, I create a new object with the custom class that I created on the fly. Then I iterate through each of the cell in the row to populate the object with the values that I get.

Here is the code that I am using to do the task that I just described using the DOM approach.

public static List<Object> ConvertExcelArchiveToListObjects(string filePath)
    {
        ... 
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
        {
            WorkbookPart wbPart = spreadsheetDocument.WorkbookPart;
            Sheets theSheets = wbPart.Workbook.Sheets;

            SharedStringTablePart sstPart = spreadsheetDocument.WorkbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
            ...
            var sheets = wbPart.Workbook.Sheets.Cast<Sheet>().ToList();

            foreach (WorksheetPart worksheetpart in wbPart.WorksheetParts)
            {
                Worksheet worksheet = worksheetpart.Worksheet;

                string partRelationshipId = wbPart.GetIdOfPart(worksheetpart);
                var correspondingSheet = sheets.FirstOrDefault(
                    s => s.Id.HasValue && s.Id.Value == partRelationshipId);
                Debug.Assert(correspondingSheet != null);
                // Grab the sheet name
                string sheetName = correspondingSheet.GetAttribute("name", "").Value;

                ...

                dynamic expandoObjectClass = new ExpandoObject();
                List<Object> listObjectsCustomClasses  = new List<Object>();
                foreach (var dataRow in rowContent)
                {
                    Type generatedType = typeBuilder.CreateType();
                    object generatedObject = Activator.CreateInstance(generatedType);

                    PropertyInfo[] properties = generatedType.GetProperties();

                    int propertiesCounter = 0;

                    // Loop over the values that we will assign to the properties

                    var rowCells = dataRow.Descendants<Cell>();
                    var value = string.Empty;
                    foreach (var rowCell in rowCells)
                    {
                        if (rowCell.DataType != null
                            && rowCell.DataType.HasValue
                            && rowCell.DataType == CellValues.SharedString
                            && int.Parse(rowCell.CellValue.InnerText) < ssTable.ChildElements.Count)
                        {
                            value = ssTable.ChildElements[int.Parse(rowCell.CellValue.InnerText)].InnerText ?? string.Empty;
                        }
                        else
                        {
                            if (rowCell.CellValue != null && rowCell.CellValue.InnerText != null)
                            {
                                value = rowCell.CellValue.InnerText;
                            }
                            else
                            {
                                value = string.Empty;
                            }
                        }
                        properties[propertiesCounter].SetValue(generatedObject, value, null);
                        propertiesCounter++;
                    }
                    listObjectsCustomClasses.Add(generatedObject);
                }
                listObjects.Add(listObjectsCustomClasses);
            }
        }
        DateTime end = DateTime.UtcNow;
        Console.WriteLine("Measured time: " + (end - begin).TotalMinutes + " minutes.");
        return listObjects;
    }

However, whenever I read big xlsx files (with a size bigger than 30 MB), the aforementioned method takes a lot of time to execute. I have written this code to at least get the rows without digging deeper to get the cells in each row.

public static List<Object> ConvertExcelArchiveToListObjectsSAXApproach(string filePath)
    {
        DateTime begin = DateTime.UtcNow;
        List<Object> listObjects = new List<Object>();
        using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
        {
            WorkbookPart wbPart = spreadsheetDocument.WorkbookPart;
            Sheets theSheets = wbPart.Workbook.Sheets;

            SharedStringTablePart sstPart = spreadsheetDocument.WorkbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();
            SharedStringTable ssTable = null;
            if (sstPart != null)
                ssTable = sstPart.SharedStringTable;

            // Get the CellFormats for cells without defined data types
            WorkbookStylesPart workbookStylesPart = spreadsheetDocument.WorkbookPart.GetPartsOfType<WorkbookStylesPart>().First();
            CellFormats cellFormats = (CellFormats)workbookStylesPart.Stylesheet.CellFormats;
            var sheets = wbPart.Workbook.Sheets.Cast<Sheet>().ToList();

            foreach (WorksheetPart worksheetpart in wbPart.WorksheetParts)
            {
                //Worksheet worksheet = worksheetpart.Worksheet;
                OpenXmlPartReader reader = new OpenXmlPartReader(worksheetpart);
                bool firstRow = false;

                while (reader.Read())
                {
                    if (reader.ElementType == typeof(Row))
                    {
                       ...
                    }

                    if (reader.ElementType != typeof(Worksheet)) // Dont' want to skip the contents of the worksheet
                        reader.Skip(); // Skip contents of any node before finding the first row.
                }    DateTime end = DateTime.UtcNow;
        Console.WriteLine("Measured time: " + (end - begin).TotalMinutes + " minutes.");
        return listObjects;
    }

However, the breakpoint that I set within

if (reader.ElementType == typeof(Row))
                    {
                       ...
                    } 

does not even get hit. Any ideas as to what am I missing? Thank you!

Amadeus Sanchez
  • 2,375
  • 2
  • 25
  • 31

1 Answers1

0

Have you seen the code in the thread Using OpenXmlReader. The code is doing exactly what you are trying to do.

Community
  • 1
  • 1
Woric
  • 26
  • 3