-1

(Alternate Title: ReadAllLines Analogue for Excel?: What's the Best Way to Load and Manipulate Excel Data?)

I would like to quickly crack open an excel sheet and perform text manipulations. I want the operation to work like ReadAllLines (https://msdn.microsoft.com/en-us/library/s2tte0y1(v=vs.110).aspx) but for Excel.

I found the following question which is on point but seven years old. Reading Excel files from C# (Furthermore, it is an historically significant question that is frozen. Moreover, I do not have 50 points so I would not be able to comment were it open.) I cut and pasted Robin Robinson's answer into Visual Studio, changing only the path:

var fileName = string.Format("{0}\\fileNameHere", Directory.GetCurrentDirectory());
var connectionString = string.Format("Provider=Microsoft.Jet.OLEDB.4.0; data source={0}; Extended Properties=Excel 8.0;", fileName);

var adapter = new OleDbDataAdapter("SELECT * FROM [workSheetNameHere$]", connectionString);
var ds = new DataSet();

adapter.Fill(ds, "anyNameHere");

var data = ds.Tables["anyNameHere"].AsEnumerable();

var query = data.Where(x => x.Field<string>("id") != string.Empty).Select(x =>
            new 
            {
                id= x.Field<string>("id"),
            });

That fails on the Fill method with 'External table is not in the expected format.'

Are good programmers doing it this way today? i.e., should I investigate Jet and see if there are updates for the latest Excel, or is there a new and improved way?

Community
  • 1
  • 1
jacoblambert
  • 787
  • 1
  • 11
  • 18
  • You've given no indication what happened other than "does not seem to work". It's entirely possible that it *will* work but you've got a configuration failure. Rather than ask a new and vague question, you'd have been better off adding a comment to the existing answer. I understand you don't have enough rep to comment yet - but that doesn't mean that asking a bad question is the right way forward. – Jon Skeet Mar 03 '15 at 11:34
  • Thanks - but I'm not looking for you to correct the problem. I'm asking whether this same practice is recommended nearly seven years later. If it is, then I'll worry about why it doesn't work. I'll work on refining the question through these comments. – jacoblambert Mar 03 '15 at 11:49
  • I think it's unlikely that this will really come out as a good, on-topic question to be honest. – Jon Skeet Mar 03 '15 at 11:53
  • Thanks, Jon. Here's what I'm driving at: I want to quickly pull in an Excel file to be manipulated by a Console app. There's LINQ to XML, LINQ to SQL, I kind of expected a LINQ to Excel. Is OLEDB the best way? And DataSets? It would seem there's something more modern. – jacoblambert Mar 03 '15 at 12:01
  • I'd probably use the Excel interop API: https://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.applicationclass – Jon Skeet Mar 03 '15 at 12:02
  • Great, I'll try to use that and post something this week. Thanks. – jacoblambert Mar 03 '15 at 12:23
  • @JonSkeet - I updated the question. Please let me know if that is more clear and if you can help. Perhaps I can't get much better than the attempt I posted and the answer is as simple as - there is no streaming an Excel file. But I don't know. Thanks again. – jacoblambert Mar 04 '15 at 14:12
  • Well I've just searched for the error message, and found http://stackoverflow.com/questions/1139390/excel-external-table-is-not-in-the-expected-format which sounds like it's relevant... – Jon Skeet Mar 04 '15 at 14:19
  • That works, once I install the Access Database Engine rovsen mentions in a comment on the answer to that link. To my surprise, even with Excel 2013. (I searched and there is no download on that page for 2013.) Still, it's very slow because I'm loading the entire file into memory. It looks like I have two viable answers to my question, and like I will convert to csv and use ReadAllLines whenever possible. Thank you, @JonSkeet. – jacoblambert Mar 04 '15 at 20:54

2 Answers2

0

I cobbled together a prototype based on Jon Skeet's recommendation to use Interop, and an answer by juliusz on How do I import from Excel to a DataSet using Microsoft.Office.Interop.Excel?

However, there are two things I really don't like about this: First, it seems I have to type in all the column names. Second, it loads the entire spreadsheet into memory rather than reading as a stream (takes a minute even for relatively small spreadsheets.

private static void Main(string[] args)
{
    var dataT = Import(@"C:\Users\jlambert\Desktop\dSmall_encrypted.xlsx");

    var data = dataT.AsEnumerable();
...
}
public static System.Data.DataTable Import(String path)
{
    var app = new Application();
    Workbook workBook = app.Workbooks.Open(path, 0, true, 5, "", "", true, XlPlatform.xlWindows, "\t", false, false, 0, true, 1, 0);

    Worksheet workSheet = (Worksheet)workBook.ActiveSheet;

    int index = 0;
    object rowIndex = 2;

    System.Data.DataTable dt = new System.Data.DataTable();
    dt.Columns.Add("Facility_code");                    
    dt.Columns.Add("MRN");
    dt.Columns.Add("first_name");
    dt.Columns.Add("middle_name");
    dt.Columns.Add("last_name");
    dt.Columns.Add("address_line_1");
    dt.Columns.Add("address_line_2");
    dt.Columns.Add("city");
    dt.Columns.Add("state");
    dt.Columns.Add("zip");
    dt.Columns.Add("date_of_birth");
    dt.Columns.Add("gender");
    dt.Columns.Add("ssn");
    dt.Columns.Add("home_phone");
    dt.Columns.Add("work_phone");
    dt.Columns.Add("cell_phone");
    dt.Columns.Add("PCP");
    dt.Columns.Add("Practice Location");

    DataRow row;

    while (((Range)workSheet.Cells[rowIndex, 1]).Value2 != null)
    {
        rowIndex = 2 + index;
        row = dt.NewRow();
        row[0] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 1]).Value2);
        row[1] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 2]).Value2);
        row[2] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 3]).Value2);
        row[3] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 4]).Value2);
        row[4] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 5]).Value2);
        row[5] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 6]).Value2);
        row[6] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 7]).Value2);
        row[7] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 8]).Value2);
        row[8] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 9]).Value2);
        row[9] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 10]).Value2);
        row[10] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 11]).Value2);
        row[11] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 12]).Value2);
        row[12] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 13]).Value2);
        row[13] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 14]).Value2);
        row[14] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 15]).Value2);
        row[15] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 16]).Value2);
        row[16] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 17]).Value2);
        row[17] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 18]).Value2);
        index++;
        dt.Rows.Add(row);
    }
    app.Workbooks.Close();
    return dt;
}
Community
  • 1
  • 1
jacoblambert
  • 787
  • 1
  • 11
  • 18
0

This is lightening fast; I can't tell I'm not doing text manipulations with ReadAllLines and Regex. (Removed the text manipulation details).

How I got it to work is explained in the comments to the original question.

Not sure why it's so much faster than Interop.Excel. Inefficient coding? More efficient API?

Any insight is appreciated!

        var path = string.Format(@"C:\Users\jlambert\Desktop\encryptedSSNs.xlsx");
        var connStr = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + path + ";Extended Properties=Excel 12.0;";


        var adapter = new OleDbDataAdapter("SELECT * FROM [sheetName$]", connStr);
        var ds = new DataSet();

        adapter.Fill(ds, "anyNameHere");

        var data = ds.Tables["anyNameHere"].AsEnumerable();

        var query = data.Where(x => x.Field<string>("MRN") != string.Empty).Select(x =>
            new 
            {
                mrn = x.Field<string>("MRN"),
                ssn = x.Field<string>("ssn"),
            });

        foreach (var q in query)
        {
            Console.WriteLine(q);    
        }
        Console.ReadLine();
jacoblambert
  • 787
  • 1
  • 11
  • 18