0

I am using the below code to trim all cells in my DataTable.

The problem is, that I am doing it through a loop, and depending on what I fill the DataTable with, if it has 1500 rows and 20 columns, the loop takes a really, really long time.

DataColumn[] stringColumns = dtDataTable.Columns.Cast<DataColumn>().Where(c => c.DataType == typeof(string)).ToArray();
foreach (DataRow row in dtDataTable.Rows)
{
    foreach (DataColumn col in stringColumns)
    {
        if (row[col] != DBNull.Value)
        {
            row.SetField<string>(col, row.Field<string>(col).Trim());
        }
    }
}

And here is how I am importing my Excel sheet to the DataTable:

using (OpenFileDialog ofd = new OpenFileDialog() { Title = "Select File", Filter = "Excel WorkBook|*.xlsx|Excel WorkBook 97-2003|*.xls|All Files(*.*)|*.*", Multiselect = false, ValidateNames = true })
{               
    if (ofd.ShowDialog() == DialogResult.OK)
    {
        String PathName = ofd.FileName;
        FileName = System.IO.Path.GetFileNameWithoutExtension(ofd.FileName);
            
        strConn = string.Empty;

        FileInfo file = new FileInfo(PathName);
        if (!file.Exists) { throw new Exception("Error, file doesn't exists!"); }
        string extension = file.Extension;
        switch (extension)
        {
            case ".xls":
                strConn = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + PathName + ";Extended Properties='Excel 8.0;HDR=Yes;IMEX=1;'";
            case ".xlsx":
                strConn = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + PathName + ";Extended Properties='Excel 12.0;HDR=Yes;IMEX=1;'";
            default:
                strConn = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + PathName + ";Extended Properties='Excel 8.0;HDR=Yes;IMEX=1;'";
        }
    }
    else
    {
        return;
    }
}
    
using (OleDbConnection cnnxls = new OleDbConnection(strConn))
{
    using (OleDbDataAdapter oda = new OleDbDataAdapter(string.Format("select * from [{0}$]", "Sheet1"), cnnxls))
    {
        oda.Fill(dtDataTableInitial);
    }
}


//Clone dtDataTableInitial so that I can have the new DataTable in String Type
dtDataTable = dtDataImportInitial.Clone();
foreach (DataColumn col in dtDataTable.Columns)
{
    col.DataType = typeof(string);
}
foreach (DataRow row in dtDataImportInitial.Rows)
{
    dtDataTable.ImportRow(row);
}

Is there a more efficient way of accomplishing this?





EDIT: As per JQSOFT's suggestion, I am using OleDbDataReader now, but am still running two issues:

One: SELECT RTRIM(LTRIM(*)) FROM [Sheet1$] doesn't seem to work.

I know that it is possible to select each column one by one, but the number of and header of the columns in the excel sheet is random, and I am not sure how to adjust my SELECT string to account for this.

Two: A column whose rows are mostly populated with numbers, but have a few rows with letters seem to have those rows with letters omitted. For example:
Col1
1
2
3
4
5
6
a
b

Becomes:
Col1
1
2
3
4
5
6

However, I have discovered that if I manually go into the excel sheet and convert the entire table cell format to "Text", this issue is resolved. However, doing this converts any dates in that excel sheet into unrecognizable strings of numbers, so I want to avoid doing this if at all possible.
For example: 7/2/2020 becomes 44014 if converted to "Text".



Here is my new code:

private void Something()
{
    if (ofd.ShowDialog() == DialogResult.OK)
    {
        PathName = ofd.FileName;
        FileName = System.IO.Path.GetFileNameWithoutExtension(ofd.FileName);

        strConn = string.Empty;

        FileInfo file = new FileInfo(PathName);
        if (!file.Exists) { throw new Exception("Error, file doesn't exists!"); }
    }

    using (OleDbConnection cn = new OleDbConnection { ConnectionString = ConnectionString(PathName, "No") })
    {
        using (OleDbCommand cmd = new OleDbCommand { CommandText = query, Connection = cn })
        {
            cn.Open();
            OleDbDataReader dr = cmd.ExecuteReader();
            dtDataTable.Load(dr);
        }
    }

    dataGridView1.DataSource = dtDataTable;
}               

public string ConnectionString(string FileName, string Header)
{
    OleDbConnectionStringBuilder Builder = new OleDbConnectionStringBuilder();
    if (Path.GetExtension(FileName).ToUpper() == ".XLS")
    {
        Builder.Provider = "Microsoft.Jet.OLEDB.4.0";
        Builder.Add("Extended Properties", string.Format("Excel 8.0;IMEX=1;HDR=Yes;", Header));
    }
    else
    {
        Builder.Provider = "Microsoft.ACE.OLEDB.12.0";
        Builder.Add("Extended Properties", string.Format("Excel 12.0;IMEX=1;HDR=Yes;", Header));
    }

    Builder.DataSource = FileName;

    return Builder.ConnectionString;
}
lolikols
  • 87
  • 1
  • 5
  • 24
  • Why do you trim them in the first place – Legacy Code Jul 01 '20 at 17:46
  • I would implement the trimming in SQL. Set based operations will always be faster than loops. You are doing a loop in a loop, so performance will degrade geometrically with number rows/columns. – AaronLS Jul 01 '20 at 17:47
  • @LegacyCode Well, I'm importing an excel sheet into my application, and sometimes there are leading and trailing whitespaces that shouldn't be there. – lolikols Jul 01 '20 at 17:51
  • @AaronLS My apologies - I have almost no experience with SQL - I am importing an Excel sheet to my DataTable. – lolikols Jul 01 '20 at 17:52
  • @lolikols Please [edit] the question and include the code that you use to import the Excel sheet into a DataTable. – 41686d6564 stands w. Palestine Jul 01 '20 at 17:55
  • Consider using `Parallel.ForEach`. https://learn.microsoft.com/en-ca/dotnet/api/system.threading.tasks.parallel.foreach?view=netcore-3.1 – Legacy Code Jul 01 '20 at 17:56
  • @AhmedAbdelhameed Done! – lolikols Jul 01 '20 at 18:04
  • @LegacyCode Hey - just tested Parallel.ForEach. Thanks for that! It's a bit faster, but it is still very time consuming :(. – lolikols Jul 01 '20 at 19:10
  • You could use the `OleDbDataReader` instead of `OleDbDataAdapter` to read from the Excel file, create and add columns and rows to the `DataTable` as you read. This way, you will be able to modify the values as you wish, save sometime, and get rid of the first code block. –  Jul 01 '20 at 23:35
  • @JQSOFT Hey, it's you again! How would this impact import speeds? I'll have to do some reading on this - I've never used Reader, but it looks like it requires a query string, is that right? – lolikols Jul 02 '20 at 00:54
  • Yes it will speed things up because you don't need to double loop again 1500x20 times to trim the string values. Use the reader to 1) Get the fields names and create DataColumn objects. 2) Get the records and create DataRow objects. Something like [this](https://stackoverflow.com/a/62226247/10216583). –  Jul 02 '20 at 07:15
  • 1
    @JQSOFT Thanks! I'll give that a read :). Much appreciated! – lolikols Jul 02 '20 at 14:07
  • So, I gave that a read. To the best of my understanding, I would need to use something like: "SELECT RTRIM(LTRIM(ColumnName)) AS TrimmedColumn FROM TableName" to trim each column - however, the number of columns in my table are random, so there is no way for me to trim them this way :(. – lolikols Jul 02 '20 at 14:12
  • Ah - also noticed when populating the DataTable using "SELECT * FROM [Sheet1$]"...if a column has mostly rows with only numbers, and then 3 of the rows in that column (for example) have letters, the letters are omitted and this is causing an issue for me. – lolikols Jul 02 '20 at 20:04
  • 1
    Updated original post. – lolikols Jul 02 '20 at 20:41
  • Often. using of SQL Server commands may be better than using linq statements. – isaeid Jul 02 '20 at 20:53
  • How often is row.Field(col).Trim() the same as row.Field(col)? – Ian Ringrose Jul 03 '20 at 08:29

1 Answers1

1

OleDb Objects


Actually what I meant is, to get formatted/trimmed string values from the Excel Sheet and create a DataTable with DataColumn objects of string type only, use the forward-only OleDbDataReader to create both, DataColumn and DataRow objects as it reads. Doing so, the data will be modified and filled in one step hence no need to call another routine to loop again and waste some more time. Also, consider using asynchronous calls to speed up the process and avoid freezing the UI while executing the lengthy task.

Something might help you to go:

private async void TheCaller()
{
    using (var ofd = new OpenFileDialog
    {
        Title = "Select File",
        Filter = "Excel WorkBook|*.xlsx|Excel WorkBook 97 - 2003|*.xls|All Files(*.*)|*.*",
        AutoUpgradeEnabled = true,
    })
    {
        if (ofd.ShowDialog() != DialogResult.OK) return;

        var conString = string.Empty;
        var msg = "Loading... Please wait.";

        try
        {
            switch (ofd.FilterIndex)
            {
                case 1: //xlsx
                    conString = $"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={ofd.FileName};Extended Properties='Excel 12.0;HDR=Yes;IMEX=1;'";                            
                    break;
                case 2: //xls
                    conString = $"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={ofd.FileName};Extended Properties='Excel 8.0;HDR=Yes;IMEX=1;'";
                    break;
                default:
                    throw new FileFormatException();
            }

            var sheetName = "sheet1";
            var dt = new DataTable();

            //Optional: a label to show the current status
            //or maybe show a ProgressBar with ProgressBarStyle = Marquee
            lblStatus.Text = msg;

            await Task.Run(() =>
            {
                using (var con = new OleDbConnection(conString))
                using (var cmd = new OleDbCommand($"SELECT * From [{sheetName}$]", con))
                {
                    con.Open();

                    using (var r = cmd.ExecuteReader())
                        while (r.Read())
                        {
                            if (dt.Columns.Count == 0)
                                for (var i = 0; i < r.FieldCount; i++)
                                    dt.Columns.Add(r.GetName(i).Trim(), typeof(string));

                            object[] values = new object[r.FieldCount];

                            r.GetValues(values);
                            dt.Rows.Add(values.Select(x => x?.ToString().Trim()).ToArray());
                        }
                }
            });

            //If you want...
            dataGridView1.DataSource = null;
            dataGridView1.DataSource = dt;

            msg = "Loading Completed";
        }
        catch (FileFormatException)
        {
            msg = "Unknown Excel file!";
        }
        catch (Exception ex)
        {
            msg = ex.Message;
        }
        finally
        {
            lblStatus.Text = msg;
        }
    }
}

Here's a demo, reading sheets with 8 columns and 5000 rows from both xls and xlsx files:

SOA62682518

Less than a second. Not bad.

However, this will not work correctly if the Sheet has mixed-types columns like your case where the third column has string and int values in different rows. That because the data type of a column is guessed in Excel by examining the first 8 rows by default. Changing this behavior requires changing the registry value of TypeGuessRows in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\x.0\Engines\Excel from 8 to 0 to force checking all the rows instead of just the first 8. This action will dramatically slow down the performance.

Office Interop Objects


Alternatively, you could use the Microsoft.Office.Interop.Excel objects to read the Excel Sheet, get and format the values of the cells regardless of their types.

using Excel = Microsoft.Office.Interop.Excel;
//...

private async void TheCaller()
{
    using (var ofd = new OpenFileDialog
    {
        Title = "Select File",
        Filter = "Excel WorkBook|*.xlsx|Excel WorkBook 97 - 2003|*.xls|All Files(*.*)|*.*",
        AutoUpgradeEnabled = true,
    })
    {
        if (ofd.ShowDialog() != DialogResult.OK) return;

        var msg = "Loading... Please wait.";
        Excel.Application xlApp = null;
        Excel.Workbook xlWorkBook = null;

        try
        {
            var dt = new DataTable();

            lblStatus.Text = msg;

            await Task.Run(() =>
            {
                xlApp = new Excel.Application();
                xlWorkBook = xlApp.Workbooks.Open(ofd.FileName, Type.Missing, true);

                var xlSheet = xlWorkBook.Sheets[1] as Excel.Worksheet;
                var xlRange = xlSheet.UsedRange;

                dt.Columns.AddRange((xlRange.Rows[xlRange.Row] as Excel.Range)
                .Cells.Cast<Excel.Range>()
                .Where(h => h.Value2 != null)
                .Select(h => new DataColumn(h.Value2.ToString()
                .Trim(), typeof(string))).ToArray());

                foreach (var r in xlRange.Rows.Cast<Excel.Range>().Skip(1))
                    dt.Rows.Add(r.Cells.Cast<Excel.Range>()
                        .Take(dt.Columns.Count)
                        .Select(v => v.Value2 is null
                        ? string.Empty
                        : v.Value2.ToString().Trim()).ToArray());
            });

            (dataGridView1.DataSource as DataTable)?.Dispose();
            dataGridView1.DataSource = null;
            dataGridView1.DataSource = dt;

            msg = "Loading Completed";
        }
        catch (FileFormatException)
        {
            msg = "Unknown Excel file!";
        }
        catch (Exception ex)
        {
            msg = ex.Message;
        }
        finally
        {
            xlWorkBook?.Close(false);
            xlApp?.Quit();

            Marshal.FinalReleaseComObject(xlWorkBook);
            Marshal.FinalReleaseComObject(xlApp);

            xlWorkBook = null;
            xlApp = null;

            GC.Collect();
            GC.WaitForPendingFinalizers();

            lblStatus.Text = msg;
        }
    }
}

Note: You need to add reference to the mentioned library.

SOA62682518B

Not fast especially with a big number of cells but it gets the desired output.

  • Ah - this works like magic! However, I noticed that if the first row (or maybe it's the column) contains objects like a radio button) in the excel sheet, the code doesn't process and just hangs. Is there a way to adjust that code to ignore the first row if there are objects? – lolikols Jul 06 '20 at 13:40
  • Sometimes the first cell in the excel sheet contains the words "Junk Data", if that helps - and that row contains objects like radio buttons. – lolikols Jul 06 '20 at 14:22
  • @lolikols Can't reproduce the problem. The code works for me unless its a Macro-Enabled Sheet. (xlsm). Can you share the file with some fake data to test it? –  Jul 06 '20 at 14:59
  • Nevermind, so sorry - it turns out it was a line of code that I forgot to comment out. I'm still running into an issue where if a column has mostly numbers, and there are a few letters - the letters are ignored completely. Here's a sample - in the last 3 rows there are letters in column 3 that will be omitted by the import process: https://www.dropbox.com/s/nxey2zl6jyvmrz3/SS%20TEST.xlsx?dl=0 – lolikols Jul 06 '20 at 15:08
  • @lolikols The problem is that you have a mixed-type column, character/string and integer types. The Provider uses the first 8 rows to _guess_ the columns data type. Therefore, in our case here, the provider defines the type of `Col3` as Integer thus the last values with string values are omitted. You need to edit some registry keys to change this behavior and force the provider check all the rows to identify the type. Read [this](https://stackoverflow.com/q/4551203/10216583) post for details. –  Jul 06 '20 at 15:52
  • @lolikols Also check [Excel connection strings](https://www.connectionstrings.com/excel/) has more about this issue. –  Jul 06 '20 at 16:13
  • Thank you very much! – lolikols Jul 06 '20 at 16:27
  • Hey, wow thanks! I actually figured out a workaround: if I left HDR=No in my connection string, then the entire sheet gets imported in string form. Then I would loop through the columns and assign headers. But your method is a LOT less goofy than my workaround! I wish you had a patreon or something - because this helped me out quite a bit. Thank you so much! – lolikols Jul 07 '20 at 18:01
  • @lolikols Welcome buddy. It just the _problem solving and learning_ addiction. See you. –  Jul 07 '20 at 18:01