0

I'm writing a console app that will extract data from many xls files and lump them into one large csv. I have all the Excel application loading/GC done between each file and I worry it's causing the application to take longer than needed.

Here is my code:

namespace SubconDataExtractToCSV
{
    class Program
    {
        static void Main(string[] args)
        {
            StreamWriter outputCSV = new StreamWriter(@"C:\Users\mbelmer\Desktop\Helpful Files\Subcon Files\SubconEmailExtract\ExtractData\ExtractedData.csv", true);
            DirectoryInfo folder = new DirectoryInfo(@"C:\Users\mbelmer\Desktop\Helpful Files\Subcon Files\SubconEmailExtract\");
            FileInfo[] files = folder.GetFiles("*.xls");
            foreach (var file in files)
            {
                ExtractData(file, outputCSV);
            }
            outputCSV.Close();
        }

        static void ExtractData(FileInfo filename, StreamWriter output)
        {
            Excel.Application xlApp = new Excel.Application();
            Excel.Workbook xlWorkbook = xlApp.Workbooks.Open(filename.FullName);
            Excel._Worksheet xlWorksheet = xlWorkbook.Sheets[1];
            Excel.Range xlRange = xlWorksheet.UsedRange;
            int rowCount = xlRange.Rows.Count;
            int colCount = xlRange.Columns.Count;
            string fileDate = filename.Name.Substring(0, 7);
            if (filename.Name.Contains("WIP"))
            {
                //EXTRACT MMT
                for(int i = 2; i <= rowCount; i++)
                {
                    string dataLine = fileDate + ",";
                    for(int j = 1; j <= 10; j++)
                    {
                        if (xlRange.Cells[i, j].Value2 != null)
                        {
                            dataLine += xlRange.Cells[i, j].Value2 + ",";
                        }
                        else
                        {
                            dataLine += ",";
                        }
                    }
                    output.WriteLine(dataLine);
                }

            }
            else
            {
                //EXTRACT AMKOR
                for(int i = 2; i <= rowCount; i++)
                {
                    string dataLine = fileDate + ",,,,,,,,,,,";
                    for(int j = 1; j <= colCount; j++)
                    {
                        if(xlRange.Cells[i,j].Value2 != null)
                        {
                            dataLine += xlRange.Cells[i, j].Value2 + ",";
                        }
                        else
                        {
                            dataLine += ",";
                        }
                    }
                    output.WriteLine(dataLine);
                }
            }
            GC.Collect();
            GC.WaitForPendingFinalizers();
            Marshal.ReleaseComObject(xlRange);
            Marshal.ReleaseComObject(xlWorksheet);
            xlWorkbook.Close();
            Marshal.ReleaseComObject(xlWorkbook);
            xlApp.Quit();
            Marshal.ReleaseComObject(xlApp);
        }
    }
}

Should I put the application open and close outside the initial foreach? It's currently going through 38 files and takes just over 6 minutes to do so. Thank you!

  • 1
    Read the Speed rant: https://ericlippert.com/2012/12/17/performance-rant/ | I added Ofice Interop to the tags, as this is a area where I just say "avoid Office Interop. Use OpenXML." – Christopher Nov 27 '19 at 20:32
  • Can I ask what have you used to come to a conclusion of `not efficiently`; for example: benchmarking or profiling? – Trevor Nov 27 '19 at 20:37
  • Ah I forgot to mention it's currently processing 38 files and takes a little over 6 minutes to do so. – Mitchel Belmer Nov 27 '19 at 20:50
  • 1
    Try it both ways and see which one is faster – robbpriestley Nov 27 '19 at 20:56
  • Tried OpenXML but it was picky about filetypes and didn't easily let me iterate through cells and rows. After research I found ExcelDataReader and ExcelDataReader.DataSet. I rewrote the application and what would have taken over 10 minutes with Office Interop, took legit under a second with ExcelDataReader. USE THIS – Mitchel Belmer Dec 06 '19 at 14:45

1 Answers1

1

As others have mentioned, interop is quirky and usually very slow especially with large files. However, I am aware that sometimes you have no other option.

One possible issue I have seen before is when using the “UsedRange” property. Ranges are notoriously slow when used in a loop. I found this out when trying to remove the “apparently” empty rows and columns returned from a “usedRange” on a large file.

When a user formats cells in Excel… say by selecting a whole row or column and sets some type of formatting for that selected range… then the “UsedRange” is going to pick this up. You may have seen this where “UsedRange” returns a bunch of empty rows at the bottom or empty columns on the right. This is because even though the cell may be empty… any cells that HAVE some type formatting will be picked up in the UsedRange property.

It was these “apparent” empty rows and columns that revealed where Excel interop is slow. You can check my answer and additional help from other users on the Fastest method to remove Empty rows and Columns From Excel Files using Interop it may help.

In my test, I tried to use the same strategy I used in the answer above. Basically, it goes like this…

If you use a Range in a loop, as your code does with the line…

dataLine += xlRange.Cells[i, j].Value2

and a few other places. This is causing a big slowdown. For some reason when you reference the range (xlRange) … it is notoriously slow.

Fortunately, there is a not so difficult solution. In my previous solution to remove the empty rows and columns it was solved by NOT using “ranges” in any loops.

Instead of DIRECTLY using the Range from the UsedRange property… CAST that used range to a two (2) dimensional object array. Then loop through the object array instead of the Excel Range. This should at least speed things up considerably.

Example, change the line where you get the used range….

Excel.Range xlRange = xlWorksheet.UsedRange;

TO

Object[,] dataArray = (Object[,])xlWorksheet.UsedRange.Cells.Value2;

Then loop through the dataArray instead of the Excel Range. A note, that since the dataArray is coming from an Excel.Range… it will not be zero (0) based, meaning the first index will be one (1). You will see this in my example.

Below is a small test using your code and comparing it speed wise against my code using the object array as described above. In my test, using two Excel files each with 5 columns and 1000 rows, it can be seen from the stopwatch that there is a significant difference. Using your code, it took approximately 51,000 milliseconds. Using my code, it took about 2,100 milliseconds.

The code is similar to yours however, you can see that in the loops, I used the object array instead of the Excel Range.

Lastly, I could be mistaken, however, I would do more research on calling the Garbage Collector…

GC.Collect();
GC.WaitForPendingFinalizers();

I do not think you are going to GAIN anything by calling the GC.

I hope this makes sense and helps.

I used a winforms application and the button click method would be the main method in your code.

private void button1_Click(object sender, EventArgs e) {
  StreamWriter outputCSV = new StreamWriter(@"D:\Test\Excel_Test\export.csv", true);
  DirectoryInfo folder = new DirectoryInfo(@"D:\Test\Excel_Test\");
  FileInfo[] files = folder.GetFiles("*.xlsx");
  int count = 1;
  Stopwatch sw = new Stopwatch();
  sw.Start();
  foreach (var file in files) {
    //ExtractData(file, outputCSV);
    ExtractData3(file, outputCSV);
    count++;
  }
  MessageBox.Show("Done! It took: " + sw.ElapsedMilliseconds + " milliseconds");
  outputCSV.Close();
}

private void ExtractData3(FileInfo filename, StreamWriter output) {
  Excel.Application xlApp = new Excel.Application();
  Excel.Workbook xlWorkbook = xlApp.Workbooks.Open(filename.FullName);
  Excel._Worksheet xlWorksheet = xlWorkbook.Sheets[1];
  Object[,] dataArray = (Object[,])xlWorksheet.UsedRange.Cells.Value2;
  string fileDate = filename.Name.Substring(0, 7);
  if (filename.Name.Contains("WIP")) {
    //EXTRACT MMT
    fileDate = fileDate + ",";
  }
  else {
    //EXTRACT AMKOR
    fileDate = fileDate + ",,,,,,,,,,,";
  }
  WriteData(dataArray, fileDate, output);
  Marshal.ReleaseComObject(xlWorksheet);
  xlWorkbook.Close();
  Marshal.ReleaseComObject(xlWorkbook);
  xlApp.Quit();
  Marshal.ReleaseComObject(xlApp);
}

private void WriteData(Object[,] dataArray, string prefix, StreamWriter output) {
  int rowCount = dataArray.GetLength(0);
  int colCount = dataArray.GetLength(1);
  StringBuilder sb = new StringBuilder();
  sb.Append(prefix + ",");
  for (int i = 1; i <= rowCount; i++) {
    for (int j = 1; j <= colCount; j++) {
      if (dataArray[i, j] != null) {
        sb.Append(dataArray[i, j]);
        if (j < colCount) {
          sb.Append(",");
        }
      }
      else {
        sb.Append(",");
      }
    }
    output.WriteLine(sb.ToString());
    sb.Clear();
    sb.Append(prefix + ",");
  }
}
JohnG
  • 9,259
  • 2
  • 20
  • 29