Optimize performance of data processing method

Question

I am using the following code to take some data (in XML like format - Not well formed) from a .txt file and then write it to an .xlsx using EPPlus after doing some processing. StreamElements is basically a modified XmlReader. My question is about performance, I have made a couple of changes but don't see what else I can do. I'm going to use this for large datasets so I'm trying to modify to make this as efficient and fast as possible. Any help will be appreciated!

I tried using p.SaveAs() to do the excel writing but it did not really see a performance difference. Are there better faster ways to do the writing? Any suggestions are welcome.

using (ExcelPackage p = new ExcelPackage())
    {
    ExcelWorksheet ws = p.Workbook.Worksheets[1];
    ws.Name = "data1";
    int rowIndex = 1; int colIndex = 1;

      foreach (var element in StreamElements(pa, "XML"))
      {
         var values = element.DescendantNodes().OfType<XText>()
         .Select(v => Regex.Replace(v.Value, "\\s+", " "));
         string[] data = string.Join(",", values).Split(',');

         data[2] = toDateTime(data[2]);

         for (int i = 0; i < data.Count(); i++)
         {
           if (rowIndex < 1000000) 
           { 
           var cell1 = ws.Cells[rowIndex, colIndex];
           cell1.Value = data[i];
           colIndex++;
           }
         }
         rowIndex++;
      }
    }

    ws.Cells[ws.Dimension.Address].AutoFitColumns();

    Byte[] bin = p.GetAsByteArray();
    using (FileStream fs = File.OpenWrite("C:\\test.xlsx"))
    {
      fs.Write(bin, 0, bin.Length);
    }

  }
}

Currently, for it to do the processing and then write 1 Million lines into an Excel worksheet, it takes about ~30-35 Minutes.

Not that it'll improve performance, but all the write-out code could be rewritten as `File.WriteAllBytes(@"c:\test.xlsx", p.GetAsByteArray());`. Have you profiled this at all? Where is the most time taken, in the writing, or in the processing? Consider using a compiled regex: https://msdn.microsoft.com/en-us/library/gg578045(v=vs.110).aspx Also, what is the point of this line? `string.Join(",", values).Split(',');`? — Ron Beyer, Jul 14 '15 at 19:22
I'm not sure **AT ALL** (hence writing it as a comment), but you could probably get much better speed and memory utilization by actually using a forward only reader rather than `element.DescendantNodes...` and then the line `string[] data = string.Join(",", values).Split(',');` should be able to be done directly in the creation of `values` and no need for the `data` array... Again, no clue if I'm even correct, just thoughts that come to mind and I hope may help... — John Bustos, Jul 14 '15 at 19:31
@RonBeyer Agreed, this line doesn't look right... I wrote this a long time ago `string.Join(",", values).Split(',')`.. but what is another way of getting an array of strings with `var values` in it? — sparta93, Jul 14 '15 at 19:45
`values.ToArray()` or you can just append it to the end of your `.Select(...)` statement and store the array in the `values` variable. — Ron Beyer, Jul 14 '15 at 19:51
`string.Join(",", values).Split(',');` will cause any individual `value` that contains a comma to be split. `values.ToArray()` will not. Which do you want? You could do `values.SelectMany(s => s.Split(',')).ToArray()` if you want to split each item and get back an array. — dbc, Jul 14 '15 at 19:55
`values.ToArray()` will be faster than `values.SelectMany(s => s.Split(',')).ToArray()` since it does strictly less work -- but the difference may be miniscule, there's no way for me to tell without profiling a real case. The real question is, which is correct? And as long as you are certain you don't want to split the inner strings (add some unit tests to make sure), you can do the former. — dbc, Jul 14 '15 at 20:23
Removed the "xml" tag. The fact that your proprietary data format has similarities to XML does not make this an XML question. — Michael Kay, Jul 14 '15 at 21:07
How many columns are there? I am curious because of the line `colIndex++;`. You never reset it so it will continue to grow with rowindex (up to 1 million) resulting in a diagonal pattern in the excel sheet. Not to mention creating an excel sheet potentially 1 million x 1 million is pretty massive and probably not even possible. But like @ScottChamberlain said, profile this or step through the code to find the bottleneck. — Ernie S, Jul 14 '15 at 23:40
@sparta93 To answer which one is faster, [go race your horses](http://ericlippert.com/2012/12/17/performance-rant/) and find out. Go start up the profiler built in to the visual studio and get actual numbers about where your code is taking the most time instead of guessing about what is taking the most time. Take special note of *"Part the third: Is that really the bottleneck?"*, you ***really*** need to identify your bottleneck before you start to go optimizeing. — Scott Chamberlain, Jul 15 '15 at 00:06
Why are you writing 1 million lines to Excel? What human can make sense of 1 million lines? If this is for the purposes of transferring data, would a CSV (text file) work just as well? — Nick.Mc, Jul 15 '15 at 00:38

score 0 · Answer 1 · edited May 23 '17 at 12:14

I've ran into this issue before and excel has a huge overhead when you're modifying worksheet cells individually one by one.

The solution to this is to create an object array and populate the worksheet using the WriteRange functionality.

using(ExcelPackage p = new ExcelPackage()) {
    ExcelWorksheet ws = p.Workbook.Worksheets[1];
    ws.Name = "data1";

    //Starting cell
    int startRow = 1;
    int startCol = 1;

    //Needed for 2D object array later on
    int maxColCount = 0;
    int maxRowCount = 0;

    //Queue data
    Queue<string[]> dataQueue = new Queue<string[]>();

    //Tried not to touch this part
    foreach(var element in StreamElements(pa, "XML")) {
        var values = element.DescendantNodes().OfType<XText>()
            .Select(v = > Regex.Replace(v.Value, "\\s+", " "));

        //Removed unnecessary split and join, use ToArray instead
        string[] eData = values.ToArray();
        eData[2] = toDateTime(eData[2]);

        //Push the data to queue and increment counters (if needed)
        dataQueue.Enqueue(eData);

        if(eData.Length > maxColCount)
            maxColCount = eData.Length;

        maxRowCount++;
    }

    //We now have the dimensions needed for our object array
    object[,] excelArr = new object[maxRowCount, maxColCount];

    //Dequeue data from Queue and populate object matrix
    int i = 0;
    while(dataQueue.Count > 0){
        string[] eData = dataQueue.Dequeue();

        for(int j = 0; j < eData.Length; j++){
            excelArr[i, j] = eData[j];
        }

        i++;
    }

    //Write data to range
    Excel.Range c1 = (Excel.Range)wsh.Cells[startRow, startCol];
    Excel.Range c2 = (Excel.Range)wsh.Cells[startRow + maxRowCount - 1, maxColCount];
    Excel.Range range = worksheet.Range[c1, c2];

    range.Value2 = excelArr;

    //Tried not to touch this stuff
    ws.Cells[ws.Dimension.Address].AutoFitColumns();

    Byte[] bin = p.GetAsByteArray();
    using(FileStream fs = File.OpenWrite("C:\\test.xlsx")) {
        fs.Write(bin, 0, bin.Length);
    }
}

I didn't try compiling this code, so double check the indexing used; and check for any small syntax errors.

A few extra pointers to consider for performance:

Try to parallel the population of the object array, since it is primarily index based (maybe have a dictionary with an index tracker Dictionary<int, string[]>) and lookup in there for faster population of the object array. You would likely have to trade space for time.
See if you are able to hardcode the column and row counts, or figure it out quickly. In my code fix, I've set counters to count the maximum rows and columns on the fly; I wouldn't recommend it as a permanent solution.
AutoFitColumns is very costly, especially if you're dealing with over a million rows

I was trying this, but it cannot find a definition for `get_Range`. It seems like that is not something defined in EPPlus. — sparta93, Jul 15 '15 at 14:01
@sparta93 see my updated answer, also make sure you reference `Microsoft.Office.Tools.Excel` — Ramie, Jul 15 '15 at 20:43

Optimize performance of data processing method

1 Answers1