0

I have very large xlsx file (6 sheets, 1 million rows on every sheet, 19 columns on each). The idea is to read all rows in all sheets and populate database with entity framework.

I have tried ms open xml sdk but its really really slow. Just to iterate over all rows and cells (in a sax way) I would probably run my program for one month or maybe more.

I have also tried this library https://github.com/ExcelDataReader/ExcelDataReader . Initially, reading with this library starts much faster (around 100 rows in 2 seconds with db inserts), but than it goes really slow (after 50000 rows, 1 minute and more). Maybe that speed decrease cause entity framework - i did not test that (I only test maybe first 1000 rows without db access), but even if it is, reading without anything is just too slow.

Does anyone have any idea how can I speed up reading of xlsx? Also, should I abandon entity framework and do inserts on my own?

Currently my code look something like (there are things like columnMapper.Populate(tmpColumns, columns); and columnMapper.getCOlumnId because different sheets have different order of columns and even different number of columns)

using (FileStream stream = File.Open(inputFilePath, FileMode.Open, FileAccess.Read))
{
    IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);
    DataSet result = excelReader.AsDataSet();

    int tableNo = 0;
    string[] tmpColumns = new string[20];
    Console.WriteLine("Start processing...");
    foreach (DataTable table in result.Tables)
    {
        // skip bad table
        if (++tableNo > 5)
        {
            continue;
        }

        for (int i = 0; i < tmpColumns.Length; ++i)
        {
            tmpColumns[i] = string.Empty;
        }

        int columns = 0;
        var rowEnumerator = table.Rows.GetEnumerator();
        rowEnumerator.MoveNext();
        foreach (string item in ((DataRow)rowEnumerator.Current).ItemArray)
        {
            tmpColumns[columns++] = item;
        }

        columnMapper.Populate(tmpColumns, columns);

        int rowNumber = 0;
        while (rowEnumerator.MoveNext())
        {
            var row = (DataRow)rowEnumerator.Current;
            int col = 0;
            foreach (object item in row.ItemArray)
            {
                tmpColumns[columnMapper.GetColumnId(col++)] = item.ToString();
            }


            var newBoxData = new BoxData()
            {
...
                Year = tmpColumns[4],
                RetentionPeriod = tmpColumns[6],
                ContractNumber = tmpColumns[7],
                Mbr = tmpColumns[8],
                CardId = tmpColumns[10],
                Package = tmpColumns[12],
                UnitType = tmpColumns[13],

                MFCBox = tmpColumns[14],
                DescriptionNameAndSurname = tmpColumns[15],
                DateFromTo = tmpColumns[16],
                OrderNumberRange = tmpColumns[17],
                PartnerCode = tmpColumns[18],
            };

            db.BoxesDatas.Add(newBoxData);

            if (++rowNumber % SaveAfterCount == 0)
            {
                db.SaveChanges();
                Console.WriteLine(rowNumber);
            }
        }
    }

    db.SaveChanges();

EDIT: Solution was to remove entity fw and to insert data with plain sql commands. So, after leaving ms libraries (entity fw, open xml sdk) and replace them with ExcelDataReader and plain sql inserts everything worked much faster. ~2,000,000 rows was extracted and inserted in db for less than 20 minutes

Igor
  • 1,835
  • 17
  • 15

2 Answers2

2

refer this:Read Excel Sheet Data into DataTable: Code Project and Best /Fastest way to read an Excel Sheet into a DataTable

Community
  • 1
  • 1
Jaimesh
  • 841
  • 4
  • 25
  • 41
  • didnt check the oldb approach because that approach is limited with necessity of office installation on client (yes i know, reading xlsx on computer without office does not make much sense). Also your first link is not good. Nevertheless, upvote – Igor May 17 '16 at 12:36
0

After solving XL performances you have also EF performances issues.
The first step is to create sometimes a new DbContext (i.e. after SaveChanges).
The second step is to use a different ORM.

bubi
  • 6,414
  • 3
  • 28
  • 45
  • 1
    and the third step is to use plain sql commands :) After leaving ms libraries (entiity fw, open xml sdk) everything works really fast :) Thanks you for suggestion – Igor May 17 '16 at 10:12