8

I am approaching to export a big amount of data (115.000 rows x 30 columnd) in Excel OpenXML Format (xlsx). I am using some libraries like DocumentFormat.OpenXML, ClosedXML, NPOI.

With each of this, OutOfMemoryException is thrown because the representation of the sheet in the memory causes an exponential memory increase.

Also closing the document file every 1000rows (and releasing memory), the next loading causes memory increase.

Is there a more performant way to export data in xlsx without occupy a lot of memory?

Christian.K
  • 47,778
  • 10
  • 99
  • 143
Gianluigi Liguori
  • 351
  • 1
  • 5
  • 13

3 Answers3

27

The OpenXML SDK is the right tool for this job but you need to be careful to use the SAX (Simple API for XML) approach rather than the DOM approach. From the linked wikipedia article for SAX:

Where the DOM operates on the document as a whole, SAX parsers operate on each piece of the XML document sequentially

This vastly reduces the amount of memory consumed when handling large Excel files.

There's a good article on it here - http://polymathprogrammer.com/2012/08/06/how-to-properly-use-openxmlwriter-to-write-large-excel-files/

Adapted from that article, here's an example that outputs 115k rows with 30 columns:

public static void LargeExport(string filename)
{
    using (SpreadsheetDocument document = SpreadsheetDocument.Create(filename, SpreadsheetDocumentType.Workbook))
    {
        //this list of attributes will be used when writing a start element
        List<OpenXmlAttribute> attributes;
        OpenXmlWriter writer;

        document.AddWorkbookPart();
        WorksheetPart workSheetPart = document.WorkbookPart.AddNewPart<WorksheetPart>();

        writer = OpenXmlWriter.Create(workSheetPart);            
        writer.WriteStartElement(new Worksheet());
        writer.WriteStartElement(new SheetData());

        for (int rowNum = 1; rowNum <= 115000; ++rowNum)
        {
            //create a new list of attributes
            attributes = new List<OpenXmlAttribute>();
            // add the row index attribute to the list
            attributes.Add(new OpenXmlAttribute("r", null, rowNum.ToString()));

            //write the row start element with the row index attribute
            writer.WriteStartElement(new Row(), attributes);

            for (int columnNum = 1; columnNum <= 30; ++columnNum)
            {
                //reset the list of attributes
                attributes = new List<OpenXmlAttribute>();
                // add data type attribute - in this case inline string (you might want to look at the shared strings table)
                attributes.Add(new OpenXmlAttribute("t", null, "str"));
                //add the cell reference attribute
                attributes.Add(new OpenXmlAttribute("r", "", string.Format("{0}{1}", GetColumnName(columnNum), rowNum)));

                //write the cell start element with the type and reference attributes
                writer.WriteStartElement(new Cell(), attributes);
                //write the cell value
                writer.WriteElement(new CellValue(string.Format("This is Row {0}, Cell {1}", rowNum, columnNum)));

                // write the end cell element
                writer.WriteEndElement();
            }

            // write the end row element
            writer.WriteEndElement();
        }

        // write the end SheetData element
        writer.WriteEndElement();
        // write the end Worksheet element
        writer.WriteEndElement();
        writer.Close();

        writer = OpenXmlWriter.Create(document.WorkbookPart);
        writer.WriteStartElement(new Workbook());
        writer.WriteStartElement(new Sheets());

        writer.WriteElement(new Sheet()
        {
            Name = "Large Sheet",
            SheetId = 1,
            Id = document.WorkbookPart.GetIdOfPart(workSheetPart)
        });

        // End Sheets
        writer.WriteEndElement();
        // End Workbook
        writer.WriteEndElement();

        writer.Close();

        document.Close();
    }
}

//A simple helper to get the column name from the column index. This is not well tested!
private static string GetColumnName(int columnIndex)
{
    int dividend = columnIndex;
    string columnName = String.Empty;
    int modifier;

    while (dividend > 0)
    {
        modifier = (dividend - 1) % 26;
        columnName = Convert.ToChar(65 + modifier).ToString() + columnName;
        dividend = (int)((dividend - modifier) / 26);
    }

    return columnName;
}
petelids
  • 12,305
  • 3
  • 47
  • 57
  • 1
    This is the best solution ever!!!! Export file of 500.000rows x 400 columns occupy 60MB average memory – Gianluigi Liguori Oct 13 '15 at 18:05
  • I'm glad I could help @GianluigiLiguori – petelids Oct 14 '15 at 12:52
  • @petelids tried your code and works! Libraries like EPPlus, CsvHelper+CsvHelper.Excel fail or have memory leaks. Do you reference info for method GetColumnName (why these numbers?...) or method tested properly? Many thanks – Ricardo stands with Ukraine May 19 '17 at 15:51
  • I'm glad it works well for you @Riga. I'm pretty sure that `GetColumnName` is correct but I've not written a bunch of tests for it to confirm. – petelids May 19 '17 at 16:06
  • What is the 26 magic number? Can't follow the logic in this GetColumnName and will be good if need to explain to my boss :) – Ricardo stands with Ukraine May 19 '17 at 16:16
  • 1
    @Riga - `26` is just the number of letters in the alphabet; when we get to `Z` then the next column becomes `AA` then from `AZ` we go to `BA` etc. `65` is the ASCII value for an `A`... I don't have any other answers that explain that logic but I do have an answer that explains it the other way around (i.e. converting a cell reference to a column index) which might help explain things a bit more. That can be found [here](http://stackoverflow.com/a/31035990/3791802). – petelids May 19 '17 at 16:28
  • Can we add style index in this approch – Justin Sam S Aug 29 '22 at 12:45
  • Anyone can show how to add multpile page to Excel document, using this technic? – Hugo Jun 12 '23 at 18:07
1

Excel is capable of opening pretty large files, as long as you have enough memory in your computer. That's most of the time the limiting factor...

99% of the libraries out there have not been built to handle large data set and you will end up with out of memory errors if you try to throw too much data at them.

A few of them, like Spout that I created, have been created to solve this problem. The trick is to stream data and avoid storing things in memory. I am not sure which language you are using (not PHP it seems like), but there may be a similar library for your language. If not, you can still take a look at Spout - it's open-source - and convert it in your language.

Adrien
  • 1,929
  • 1
  • 13
  • 23
  • I agree with you that problem is stream data or point to file (avoiding to load full worksheet representation in memory). Your Spout is really interesting but unfortunately the language I use is C # and a porting would be too expansive for me – Gianluigi Liguori Sep 21 '15 at 17:36
  • @GianluigiLiguori – perhaps there is a way to install PHP and use the library directly. – miroxlav Sep 21 '15 at 18:40
  • Obviously but i'm looking for native .NET Solution – Gianluigi Liguori Sep 21 '15 at 19:35
  • Have you tried this: http://www.microsoft.com/en-us/download/details.aspx?id=5124 ? It's built by Microsoft and seems to support large spreadsheets – Adrien Sep 21 '15 at 21:33
  • Yes, it cause the same problem: ClosedXML itself use OpenXML under the hood – Gianluigi Liguori Sep 22 '15 at 14:39
-1

It looks you are using spreadsheet where a database has to be used. It has its limitations and this can be easily one of them. Read further only in case if you absolutely need to stick with existing solution. However, I do not recommend it. Because there is one more question: if Excel is unable to save such a large file, will it be able to open such a file?

So if you cannot switch to database platform and standard libraries you mentioned above are internally uncapable of processing of such amount of a data, then perhaps you are on your own when creating large XLSX. I mean for example this approach:

  1. export your data in batches (of 1,000 or 10,000 or whatever works) to separate files for each batch
  2. create a tool ( (this is closest to ), , , , whatever has solid XML libraries) which joins separate files into one. It involves:

    1. extracting XML from XLSX (typically file.xlsx\xl\worksheets\sheet1.xml and file.xlsx\xl\worksheets\sharedStrings.xml)
    2. gluing these parts together by XML manipulation library (this shouldn't crash on OutOfMemoryException because you are no longer working with complex spreadsheet objects)
    3. repacking result files back to main XLSX (you can take first batch output file as main XLSX)

I have shown you possible way to accomplish the result but I would avoid that. Excel was never a platform for storing large amounts of data. Compared to the above task, it could be easier to convince the management that is it time to change the tools/processes in this area.

miroxlav
  • 11,796
  • 5
  • 58
  • 99