I have very large xlsx file (6 sheets, 1 million rows on every sheet, 19 columns on each). The idea is to read all rows in all sheets and populate database with entity framework.
I have tried ms open xml sdk but its really really slow. Just to iterate over all rows and cells (in a sax way) I would probably run my program for one month or maybe more.
I have also tried this library https://github.com/ExcelDataReader/ExcelDataReader . Initially, reading with this library starts much faster (around 100 rows in 2 seconds with db inserts), but than it goes really slow (after 50000 rows, 1 minute and more). Maybe that speed decrease cause entity framework - i did not test that (I only test maybe first 1000 rows without db access), but even if it is, reading without anything is just too slow.
Does anyone have any idea how can I speed up reading of xlsx? Also, should I abandon entity framework and do inserts on my own?
Currently my code look something like (there are things like columnMapper.Populate(tmpColumns, columns);
and columnMapper.getCOlumnId
because different sheets have different order of columns and even different number of columns)
using (FileStream stream = File.Open(inputFilePath, FileMode.Open, FileAccess.Read))
{
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(stream);
DataSet result = excelReader.AsDataSet();
int tableNo = 0;
string[] tmpColumns = new string[20];
Console.WriteLine("Start processing...");
foreach (DataTable table in result.Tables)
{
// skip bad table
if (++tableNo > 5)
{
continue;
}
for (int i = 0; i < tmpColumns.Length; ++i)
{
tmpColumns[i] = string.Empty;
}
int columns = 0;
var rowEnumerator = table.Rows.GetEnumerator();
rowEnumerator.MoveNext();
foreach (string item in ((DataRow)rowEnumerator.Current).ItemArray)
{
tmpColumns[columns++] = item;
}
columnMapper.Populate(tmpColumns, columns);
int rowNumber = 0;
while (rowEnumerator.MoveNext())
{
var row = (DataRow)rowEnumerator.Current;
int col = 0;
foreach (object item in row.ItemArray)
{
tmpColumns[columnMapper.GetColumnId(col++)] = item.ToString();
}
var newBoxData = new BoxData()
{
...
Year = tmpColumns[4],
RetentionPeriod = tmpColumns[6],
ContractNumber = tmpColumns[7],
Mbr = tmpColumns[8],
CardId = tmpColumns[10],
Package = tmpColumns[12],
UnitType = tmpColumns[13],
MFCBox = tmpColumns[14],
DescriptionNameAndSurname = tmpColumns[15],
DateFromTo = tmpColumns[16],
OrderNumberRange = tmpColumns[17],
PartnerCode = tmpColumns[18],
};
db.BoxesDatas.Add(newBoxData);
if (++rowNumber % SaveAfterCount == 0)
{
db.SaveChanges();
Console.WriteLine(rowNumber);
}
}
}
db.SaveChanges();
EDIT: Solution was to remove entity fw and to insert data with plain sql commands. So, after leaving ms libraries (entity fw, open xml sdk) and replace them with ExcelDataReader and plain sql inserts everything worked much faster. ~2,000,000 rows was extracted and inserted in db for less than 20 minutes