4

Setup Environment:

I'm developing an Excel 2010 Application Level Add-in using vb.net with .NET framework 4.


My goal:

  1. Have user type in multiple names to search for
  2. Use the list of names to execute a SQL query on a LARGE spreadsheet (30,000+ rows)
  3. Return the recordset and paste into new worksheet

Performance is my priority. I'd like to know the fastest way to do this by leveraging the .NET framework.

Using an ADO Connection Object in my code works, but the process takes too long (5 - 8 seconds).


This is the SQL query I'm using on the table named wells:

    SELECT * 
    FROM wells 
    WHERE padgroup in 

    (SELECT padgroup 
     FROM wells 
     WHERE name LIKE 'TOMCHUCK 21-30'
             OR name LIKE 'FEDERAL 41-25PH')


Here's a portion of what the table looks like:

Excel Table


I'm using this code right now to create an ADO Connection Object to retrieve my results:

    'Create Recordset Object
    rsCon = CreateObject("ADODB.Connection")
    rsData = CreateObject("ADODB.Recordset")

    rsCon.Open(szConnect)
    rsData.Open(mySQLQueryToExecute, rsCon, 0, 1, 1)

    'Check to make sure data is received, then copy the data
    If Not rsData.EOF Then

        TargetRange.Cells(1, 1).CopyFromRecordset(rsData)

    Else

        MsgBox("No records returned from : " & SourceFile, vbCritical)

    End If

    'Clean up the Recordset object
    rsData.Close()
    rsData = Nothing
    rsCon.Close()
    rsCon = Nothing


Based on what I know, Excel spreadsheets are stored in the Open XML format and the .NET framework includes native support for parsing XML.

After researching it, I came across a few different options:


Could somebody provide a pointer on what would be the best method to use? I'd really appreciate it.


Additional Notes:

  • All queries need to be able to perform without connectivity to an online database
  • I only need access to the spreadsheet once to extract the raw data from rows


Right now I just embed the spreadsheet as a project resource.

Then, at run-time I create the file, run the query, store the results in-memory, and delete the file.

   'Create temp file path in the commonapplicationdata folder
    Dim excelsheetpath As StringBuilder

    excelsheetpath = New StringBuilder(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData))

    excelsheetpath.Append("\MasterList.xlsm")

    'Save resources into temp location in HD
    System.IO.File.WriteAllBytes(excelsheetpath.ToString, My.Resources.MasterList)

    'Now call the function to use ADO to get records from the MasterList.xlsm file just created
    GetData(excelsheetpath.ToString, "Sheet1", "A1:S40000", True, False)

    'Store the results in-memory and display by adding to a datagridview control (in a custom task pane)

    'Delete the spreadsheet
    System.IO.File.Delete(excelsheetpath.ToString())
Brock Gion
  • 926
  • 2
  • 9
  • 21
  • 1
    is this part of a bigger application? is it an open type name and close kind of thing? they way you load the data and when to load it plays big based on the type of app, Although its messy I like to use the Excel Interop, it seems to be the most feature rich option, although if you only need the data and not information about the cell, you might be best off with Linq you can [Get Some More Information on Linq and Excel Here](http://blogs.msdn.com/b/ericwhite/archive/2008/07/09/open-xml-sdk-and-linq-to-xml.aspx) – user2140261 Mar 14 '14 at 16:26
  • @user2140261 The link looks promising, I'll give it a try and report back. Right now, to load the data, I just embed the excel sheet as a project resource. Then, in the code, I temporarily create file, run the query, delete the file. I updated my question to include this code. – Brock Gion Mar 14 '14 at 17:03
  • This may be evaluating that subquery every row. I would run the query independently and use it to manually build up the first. Use distinct. – paparazzo Mar 14 '14 at 17:07
  • Run the sub-query. Use the results from the subquery to build up a hard coded WHERE padgroup in ('3492','3494'). You might have a situation where the sub-query is run for every row. Pretty simple - test a WHERE padgroup in ('3492','3494') – paparazzo Mar 14 '14 at 17:14
  • @Blam I put `SELECT * FROM wells WHERE padgroup in (3492, 3494)` Tested it out and the difference is minimal. Still averages about 5-8 seconds. Also, side not, I just ran this query using VBA (without any extra code) and it still takes that 5-8 seconds. I think it's something to do with the time to setup the ADO connection and not the query itself – Brock Gion Mar 14 '14 at 18:25
  • It was worth a try. Select top 1 with no where would tell you more about the setup time. – paparazzo Mar 14 '14 at 18:45

3 Answers3

2

You are doing VSTO the wrong way ;) Don't use SQL with Excel. If you have the need for speed, exploit VSTO and the native Excel APIs. You can skip the overhead of the ADODB/OLEDB layers and go straight to the Excel object model to use the blazing fast Autofilter in Excel, the SpecialCells method to get only visible cells into a multi-area range, and the Value method to quickly copy a range to an array.

Here is a sample VSTO 2010 customized workbook that quickly searches against a list of 58k words for words containing "aba", "cat", or "zon".

using System;
using System.Collections.Generic;
using System.Data;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Xml.Linq;
using Microsoft.Office.Tools.Excel;
using Microsoft.VisualStudio.Tools.Applications.Runtime;
using Excel = Microsoft.Office.Interop.Excel;
using Office = Microsoft.Office.Core;

namespace ExcelWorkbook1
{
    public partial class ThisWorkbook
    {
        private void ThisWorkbook_Startup(object sender, System.EventArgs e)
        {
            const int Sheet1 = 1; // you can use Linq to find a sheet by name if needed
            const int ColumnB = 2;
            List<List<object>> results = Query(Sheet1, ColumnB, "aba", "cat", "zon");

            foreach (List<object> record in results)
            {
                System.Diagnostics.Debug.Print("{0,-10} {1,30} {2}", record[0], record[1], record[2]);
            }
        }

        private void ThisWorkbook_Shutdown(object sender, System.EventArgs e)
        {
        }

        /// <summary>
        ///     Removes any existing Excel autofilters from the worksheet
        /// </summary>
        private void ClearFilter(Microsoft.Office.Interop.Excel._Worksheet worksheet)
        {
            if (worksheet.AutoFilter != null)
            {
                worksheet.Cells.AutoFilter();
            }
        }

        /// <summary>
        ///     Applies an Excel Autofilter to the worksheet for search for an array of substring predicates
        /// </summary>
        private void ApplyFilter(Microsoft.Office.Interop.Excel._Worksheet worksheet, int column, params string[] predicates)
        {
            string[] criteria = new string[predicates.Length];
            int i = 0;

            ClearFilter(worksheet);

            foreach (string value in predicates)
            {
                criteria[i++] = String.Concat("=*", value, "*");
            }

            worksheet.Cells.AutoFilter(column, criteria, Excel.XlAutoFilterOperator.xlOr); 
        }

        /// <summary>
        ///     Returns a list of rows that are hits on a search for an array of substrings in Column B of Sheet1
        /// </summary>
        private List<List<object>> Query(int sheetIndex, int columnIndex, params string[] words)
        {
            Microsoft.Office.Interop.Excel._Worksheet worksheet;
            Excel.Range range;
            List<List<object>> records = new List<List<object>>();
            List<object> record;
            object[,] cells;
            object value;
            int row, column, rows, columns;
            bool hit;

            try
            {
                worksheet = (Microsoft.Office.Interop.Excel._Worksheet)Globals.ThisWorkbook.Sheets[sheetIndex];
                if (null == worksheet)
                {
                    return null;
                }

                // apply the autofilter
                ApplyFilter(worksheet, columnIndex, words);

                // get the 
                range = worksheet.Range["$A:$C"].SpecialCells(Excel.XlCellType.xlCellTypeVisible);
                foreach (Excel.Range subrange in range.Areas)
                {
                    // copy the cells to a multidimensional array for perfomance
                    cells = subrange.Value;

                    // transform the multidimensional array to a List
                    for (row = cells.GetLowerBound(0), rows = cells.GetUpperBound(0); row <= rows; row++)
                    {
                        record = new List<object>();
                        hit = false;

                        for (column = cells.GetLowerBound(1), columns = cells.GetUpperBound(1); column <= columns; column++)
                        {
                            value = cells[row, column];
                            hit = hit || (null != value);

                            if (hit)
                            {
                                record.Add(cells[row, column]);
                            }
                        }

                        if (hit)
                        {
                            records.Add(record);
                        }
                    }
                }
            }
            catch { }
            finally
            {
                // use GC.Collect over Marshal.ReleaseComObject() to release all RCWs per http://stackoverflow.com/a/17131389/1995977 and more
                cells = null;
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }

            return records;
        }

        #region VSTO Designer generated code

        /// <summary>
        /// Required method for Designer support - do not modify
        /// the contents of this method with the code editor.
        /// </summary>
        private void InternalStartup()
        {
            this.Startup += new System.EventHandler(ThisWorkbook_Startup);
            this.Shutdown += new System.EventHandler(ThisWorkbook_Shutdown);
        }

        #endregion

    }
}
Steve Jansen
  • 9,398
  • 2
  • 29
  • 34
0

Excel 2010 files are not quite XML. Take an XLSX (or XMSM) file, and rename it with the .zip extension. Then extract it to a new folder. The files in the sub-folders will be XML files, yes, but the actual XLSX file is a zip file containing a collection of folders containing XML files.

Your best bet, I think, is to use the ACE drivers (JET is not supported any more) and access it as via ODBC. If this is not fast enough, you may be able to take extracts at certain times and upload them to a Database which you can run your queries from; the queries should be faster, but will be potentially out of date.

simon at rcl
  • 7,326
  • 1
  • 17
  • 24
  • Thanks for the idea - I'll try it and report back. Also, I forgot to specify this in the question - all queries need to be able to perform without connectivity to an online database. – Brock Gion Mar 14 '14 at 16:43
0

My Solution:

I tried three different approaches:

  • ADO Connection Object with SQL (slowest)
  • VSTO and Excel's Autofilter (reliable)
  • LINQ to XML (fastest)

LINQ to XML offered the best performance. I converted my table into an XML file:

XML Table


Then, in my code, I used StringReader to bring in the XMLwellData file (which is saved as a project resource).

    'welldoc will be the file to do queries on using LINQ to XML
    Dim stream As System.IO.StringReader
    stream = New StringReader(My.Resources.XMLwellData)

    welldoc = XDocument.Load(stream)

    'clean up stream now that it's no longer needed
    stream.Close()
    stream.Dispose()


    '***** later in the code perform my query on XML file *********

        Dim query = _
     From well In welldoc.<wellList>.<well> _
     Where well.<name>.Value Like "TOMCHUCK 21-30" _
     Select well

     For Each well in query

        MessageBox.Show(well.<padgroup>.value)

     Next

This was amazingly simple to do what I wanted and best of all it was FAST.

Thank you for all the help and suggestions. It made all the difference for me figuring this out.


Alternative Method using Excel's Autofilter


If you're trying to use the code suggested in the other answer, this will only filter on two values:

    worksheet.Cells.AutoFilter(column, criteria, Excel.XlAutoFilterOperator.xlOr);


So, to filter with multiple criteria using Excel's Auotfilter, you must pass your arguments as an array and filter on xlFilterValues.


    Dim wrkbk As Excel.Workbook
    Dim wrksht As Excel.Worksheet
    Dim myRange As Excel.Range
    Dim cell As Excel.Range

    'You would add all of your wellnames to search to this List
    Dim wellNames As New List(Of String)

    wrksht = wrkbk.Sheets(1)

    'In my excel file there is a Named Range which includes the all the information
    myRange = wrksht.Range("WellLookUpTable")

    'Notice, for `Criteria1:=` you MUST convert the List to an array
    With wrksht.Range("WellLookUpTable")
        .AutoFilter(Field:=2, Criteria1:=wellNames.ToArray, Operator:=Excel.XlAutoFilterOperator.xlFilterValues)
    End With

    myRange = wrksht.Range("A2", wrksht.Range("A2").End(Excel.XlDirection.xlDown)).Cells.SpecialCells(Excel.XlCellType.xlCellTypeVisible)

    For Each cell In myRange

        'column 11 is padgroup
        MessageBox.Show(cell.Offset(0, 11).Value)

    Next
Brock Gion
  • 926
  • 2
  • 9
  • 21
  • Interesting solution, but, it's apples to oranges. ADO and VSTO are both querying data stored in an Excel worksheet, while XML-LINQ is querying an XML file compiled into a resource. The first two allow creating/modifying/deleting data in the Excel UI, the latter doesn't. The fastest option would be an indexed array of string constants compiled into your .Net source code ;) Glad it worked out for you. – Steve Jansen Mar 26 '14 at 22:39
  • @SteveJansen I agree - it's hard for me to say that one is really better than the other. I just know for my application the XML solution provided exactly what I needed. Thank you again for your answer. Going through the code taught me a few things – Brock Gion Mar 26 '14 at 22:46