3

I have a requirement where-in I have to fill dataTable from a sheet of Microsoft excel.

The sheet may have lots of data so the requirement is that when a foreach loop is iterated over the data table which is supposed to hold the data from Microsoft excel sheet should fill the table on demand.

Meaning if there are 1000000 records in the sheet the data table should fetch data in batches of 100 depending on the current position of the foreach current item in the loop.

Any pointer or suggestion will be appreciated.

Matt
  • 25,467
  • 18
  • 120
  • 187
user3048027
  • 387
  • 1
  • 5
  • 24
  • Take a look to NPOI libraries for read the excel file and try to do what you need. – Stefano Cavion Jun 04 '21 at 06:47
  • I have used NPOI library and ClosedXML to read excel file and load rows as batches rather in one shot. For this I have written custom enumerator,enumarable where in I have defined the batch size as configureable and in moveNext I am checking the batchsize and if required loading the next batch of rows. But the loading happens by iterating through each row from current position. This works but not great perfomance andy pointer how differently I can do it – user3048027 Jun 22 '21 at 07:44
  • 4
    throw in your sample code, someone will be able to help – Sekhar Jun 23 '21 at 03:49
  • What are your expectations? What do you consider "not great performance"? – Tu deschizi eu inchid Jun 27 '21 at 20:13

6 Answers6

0

I would suggest you to use OpenXML to parse and read your excel data from file. This will also allow you to read out specific sections/regions from your workbook.

You will find more information and also an example at this link: Microsoft Docs - Parse and read a large spreadsheet document (Open XML SDK)

This will be more efficiently and easier to develop than use the official microsoft office excel interop.

Robert Wolf
  • 191
  • 1
  • 11
  • Hi. Thanks for the input I read the link provided but the example shows how to read cell by cell. I tried to find how to read specific sections/regions or a few specific rows could not find them. Any pointer to such examples or docs – user3048027 Jun 24 '21 at 00:50
0

**I am not near a PC with Visual stuido, so this code is untested, and may have syntax errors until I can test it later.

It will still give you the main idea of what needs to be done.

private void ExcelDataPages(int firstRecord, int numberOfRecords)
{
    
    Excel.Application dataApp = new Excel.Application(); 
    Excel.Workbook dataWorkbook = new Excel.Workbook();
    int x = 0;
    
    dataWorkbook.DisplayAlerts = false;
    dataWorkbook.Visible = false;
    dataWorkbook.AutomationSecurity = Microsoft.Office.Core.MsoAutomationSecurity.msoAutomationSecurityLow;
    dataWorkbook = dataApp.Open(@"C:\Test\YourWorkbook.xlsx");
    
    try
    {
        Excel.Worksheet dataSheet = dataWorkbook.Sheet("Name of Sheet");
        
        while (x < numberOfRecords)
        {
            Range currentRange = dataSheet.Rows[firstRecord + x]; //For all columns in row 
    

            foreach (Range r in currentRange.Cells) //currentRange represents all the columns in the row
            {
                // do what you need to with the Data here.
            }
             x++;
        }
    }
    catch (Exception ex)
    {
        //Enter in Error handling
    }

    dataWorkbook.Close(false); //Depending on how quick you will access the next batch of data, you may not want to close the Workbook, reducing load time each time.  This may also mean you need to move the open of the workbook to a higher level in your class, or if this is the main process of the app, make it static, stopping the garbage collector from destroying the connection.
    dataApp.Quit();

}
Luke Attard
  • 161
  • 1
  • 4
0

Give the following a try--it uses NuGet package DocumentFormat.OpenXml The code is from Using OpenXmlReader. However, I modified it to add data to a DataTable. Since you're reading data from the same Excel file multiple times, it's faster to open the Excel file once using an instance of SpreadSheetDocument and dispose of it when finished. Since the instance of SpreedSheetDocument needs to be disposed of before your application exits, IDisposable is used.

Where it says "ToDo", you'll need to replace the code that creates the DataTable columns with your own code to create the correct columns for your project.

I tested the code below with an Excel file containing approximately 15,000 rows. When reading 100 rows at a time, the first read took approximately 500 ms - 800 ms, whereas subsequent reads took approximately 100 ms - 400 ms.

Create a class (name: HelperOpenXml)

HelperOpenXml.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Spreadsheet;
using System.Data;
using System.Diagnostics;

namespace ExcelReadSpecifiedRowsUsingOpenXml
{
    public class HelperOpenXml : IDisposable
    {
        public string Filename { get; private set; } = string.Empty;
        public int RowCount { get; private set; } = 0;

        private SpreadsheetDocument spreadsheetDocument = null;

        private DataTable dt = null;
        

        public HelperOpenXml(string filename)
        {
            this.Filename = filename;
        }

        public void Dispose()
        {
            if (spreadsheetDocument != null)
            {
                try
                {
                    spreadsheetDocument.Dispose();
                    dt.Clear();
                }
                catch(Exception ex)
                {
                    throw ex;
                }
            }
        }

        public DataTable GetRowsSax(int startRow, int endRow, bool firstRowIsHeader = false)
        {
            int startIndex = startRow;
            int endIndex = endRow;

            if (firstRowIsHeader)
            {
                //if first row is header, increment by 1
                startIndex = startRow + 1;
                endIndex = endRow + 1;
            }

            if (spreadsheetDocument == null)
            {
                //create new instance
                spreadsheetDocument = SpreadsheetDocument.Open(Filename, false);

                //create new instance
                dt = new DataTable();

                //ToDo: replace 'dt.Columns.Add(...)' below with your code to create the DataTable columns
                //add columns to DataTable
                dt.Columns.Add("A");
                dt.Columns.Add("B");
                dt.Columns.Add("C");
                dt.Columns.Add("D");
                dt.Columns.Add("E");
                dt.Columns.Add("F");
                dt.Columns.Add("G");
                dt.Columns.Add("H");
                dt.Columns.Add("I");
                dt.Columns.Add("J");
                dt.Columns.Add("K");

            }
            else
            {
                //remove existing data from DataTable
                dt.Rows.Clear(); 

            }

            WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;

            int numWorkSheetParts = 0;

            foreach (WorksheetPart worksheetPart in workbookPart.WorksheetParts)
            {
                using (OpenXmlReader reader = OpenXmlReader.Create(worksheetPart))
                {
                    int rowIndex = 0;

                    //use the reader to read the XML
                    while (reader.Read())
                    {
                        if (reader.ElementType == typeof(Row))
                        {
                            reader.ReadFirstChild();

                            List<string> cValues = new List<string>();
                            int colIndex = 0;
                            do
                            {
                                //only get data from desired rows
                                if ((rowIndex > 0 && rowIndex >= startIndex && rowIndex <= endIndex) ||
                                (rowIndex == 0 && !firstRowIsHeader && rowIndex >= startIndex && rowIndex <= endIndex))
                                {

                                    if (reader.ElementType == typeof(Cell))
                                    {
                                        Cell c = (Cell)reader.LoadCurrentElement();

                                        string cellRef = c.CellReference; //ex: A1, B1, ..., A2, B2

                                        string cellValue = string.Empty;

                                        //string/text data is stored in SharedString
                                        if (c.DataType != null && c.DataType == CellValues.SharedString)
                                        {
                                            SharedStringItem ssi = workbookPart.SharedStringTablePart.SharedStringTable.Elements<SharedStringItem>().ElementAt(int.Parse(c.CellValue.InnerText));

                                            cellValue = ssi.Text.Text;
                                        }
                                        else
                                        {
                                            cellValue = c.CellValue.InnerText;
                                        }

                                        //Debug.WriteLine("{0}: {1} ", c.CellReference, cellValue);

                                        //add value to List which is used to add a row to the DataTable
                                        cValues.Add(cellValue);
                                    }
                                }

                                colIndex += 1; //increment

                            } while (reader.ReadNextSibling());

                            if (cValues.Count > 0)
                            {
                                //if List contains data, use it to add row to DataTable
                                dt.Rows.Add(cValues.ToArray()); 
                            }

                            rowIndex += 1; //increment

                            if (rowIndex > endIndex)
                            {
                                break; //exit loop
                            }
                        }
                    }
                }

                numWorkSheetParts += 1; //increment
            }

            DisplayDataTableData(dt); //display data in DataTable

            return dt;
        }

        
        private void DisplayDataTableData(DataTable dt)
        {
            foreach (DataColumn dc in dt.Columns)
            {
                Debug.WriteLine("colName: " + dc.ColumnName);
            }

            foreach (DataRow r in dt.Rows)
            {
                Debug.WriteLine(r[0].ToString() + " " + r[1].ToString());
            }
        }

    }
}

Usage:

private string excelFilename = @"C:\Temp\Test.xlsx";
private HelperOpenXml helperOpenXml = null;

            ...

private void GetData(int startIndex, int endIndex, bool firstRowIsHeader)
{
    helperOpenXml.GetRowsSax(startIndex, endIndex, firstRowIsHeader);
}

Note: Make sure to call Dispose() (ex: helperOpenXml.Dispose();) before your application exits.

Update:

OpenXML stores dates as the number of days since 01 Jan 1900. For dates prior to 01 Jan 1900, they are stored in SharedString. For more info see Reading a date from xlsx using open xml sdk

Here's a code snippet:

Cell c = (Cell)reader.LoadCurrentElement();
             ...
string cellValue = string.Empty
             ...
cellValue = c.CellValue.InnerText;

double dateCellValue = 0;
Double.TryParse(cellValue, out dateCellValue);

DateTime dt = DateTime.FromOADate(dateCellValue);

cellValue = dt.ToString("yyyy/MM/dd");
Tu deschizi eu inchid
  • 4,117
  • 3
  • 13
  • 24
  • This approach I am facing issue with cells which have date value and cells which have a large number and stored in format 1.71E + 15. Can anyone help – user3048027 Jun 28 '21 at 10:34
  • @user3048027: You haven't provided any sample data. I added a code snippet to the end of the post above to show how to convert from an int value to a Date value when the cell contains a Date value. Not sure what issue you're facing with "1.71E+15". If desired one can use `Decimal.TryParse` to convert the string value `1.71E+15` to Decimal. Then use `Decimal.ToString(...)` to convert it to the desired string format. – Tu deschizi eu inchid Jun 28 '21 at 19:52
0

I use this code with EPPlus DLL, Don't forget to add reference. But should check to match with your requirement.

public DataTable ReadExcelDatatable(bool hasHeader = true)
{
    using (var pck = new OfficeOpenXml.ExcelPackage())
    {
        using (var stream = File.OpenRead(this._fullPath))
        {
            pck.Load(stream);
        }

        var ws = pck.Workbook.Worksheets.First();

        DataTable tbl = new DataTable();

        int i = 1;
        foreach (var firstRowCell in ws.Cells[1, 1, 1, ws.Dimension.End.Column])
        {
            //table head
            tbl.Columns.Add(hasHeader ? firstRowCell.Text : string.Format("Column {0}", firstRowCell.Start.Column));

            tbl.Columns.Add(_tableHead[i]);
            i++;
        }

        var startRow = hasHeader ? 2 : 1;
        
        for (int rowNum = startRow; rowNum <= ws.Dimension.End.Row; rowNum++)
        {
            var wsRow = ws.Cells[rowNum, 1, rowNum, ws.Dimension.End.Column];
            DataRow row = tbl.Rows.Add();
            foreach (var cell in wsRow)
            {
                row[cell.Start.Column - 1] = cell.Text;
            }
        }

        return tbl;
    }
}
0

Another simple alternative is this: Take a look at the NUGET package ExcelDataReader, with additional information on https://github.com/ExcelDataReader/ExcelDataReader

Usage example:

[Fact] 
void Test_ExcelDataReader() 
{
    
    System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
    var scriptPath = Path.GetDirectoryName(Util.CurrentQueryPath); // LinqPad script path
    var filePath = $@"{scriptPath}\TestExcel.xlsx";
    using (var stream = File.Open(filePath, FileMode.Open, FileAccess.Read))
    {
        // Auto-detect format, supports:
        //  - Binary Excel files (2.0-2003 format; *.xls)
        //  - OpenXml Excel files (2007 format; *.xlsx, *.xlsb)
        using (var reader = ExcelDataReader.ExcelReaderFactory.CreateReader(stream))
        {
            var result = reader.AsDataSet();
            // The result of each spreadsheet is in result.Tables
            var t0 = result.Tables[0];
            Assert.True(t0.Rows[0][0].Dump("R0C0").ToString()=="Hello", "Expected 'Hello'");
            Assert.True(t0.Rows[0][1].Dump("R0C1").ToString()=="World!", "Expected 'World!'");          
        } // using
    } // using
} // fact

Before you start reading, you need to set and encoding provider as follows:

 System.Text.Encoding.RegisterProvider(
      System.Text.CodePagesEncodingProvider.Instance);

The cells are addressed the following way:

 var t0 = result.Tables[0]; // table 0 is the first worksheet
 var cell = t0.Rows[0][0];  // on table t0, read cell row 0 column 0

And you can easily loop through the rows and columns in a for loop as follows:

for (int r = 0; r < t0.Rows.Count; r++)
{
    var row = t0.Rows[r];
    var columns = row.ItemArray;
    for (int c = 0; c < columns.Length; c++)
    {
        var cell = columns[c];
        cell.Dump();
    }
}
Matt
  • 25,467
  • 18
  • 120
  • 187
0

I'm going to give you a different answer. If the performance is bad loading a million rows into a DataTable resort to using a Driver to load the data: How to open a huge excel file efficiently

DataSet excelDataSet = new DataSet();

string filePath = @"c:\temp\BigBook.xlsx";

// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";

using (OleDbConnection conn = new OleDbConnection(connectionString))
{
    conn.Open();
    OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
    ("select * from [Sheet1$]", conn);
    objDA.Fill(excelDataSet);
    //dataGridView1.DataSource = excelDataSet.Tables[0];
}

Next filter the DataSet's DataTable using a DataView. Using a DataView's RowFilter property you can specify subsets of rows based on their column values.

DataView prodView = new DataView(excelDataSet.Tables[0],  
"UnitsInStock <= ReorderLevel",  
"SupplierID, ProductName",  
DataViewRowState.CurrentRows); 

Ref: https://www.c-sharpcorner.com/article/dataview-in-C-Sharp/

Or you could use the DataTables' DefaultView RowFilter directly:

excelDataSet.Tables[0].DefaultView.RowFilter = "Amount >= 5000 and Amount <= 5999 and Name = 'StackOverflow'";
Jeremy Thompson
  • 61,933
  • 36
  • 195
  • 321