6

I've read a bunch of stuff on the web about how to get at cell data using the OpenXML API. But there's really not much out there that's particularly straightforward. Most seems to be about writing to SpreadsheetML, not reading... but even that doesn't help much. I've got a spreadsheet that has a table in it. I know what the table name is, and I can find out what sheet it's on, and what columns are in the table. But I can't figure out how to get a collection of rows back that contain the data in the table.

I've got this to load the document and get a handle to the workbook:

SpreadsheetDocument document = SpreadsheetDocument.Open("file.xlsx", false);
WorkbookPart workbook = document.WorkbookPart;

I've got this to find the table/sheet:

Table table = null;
foreach (Sheet sheet in workbook.Workbook.GetFirstChild<Sheets>())
{
    WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart.GetPartById(sheet.Id);
    foreach (TableDefinitionPart tableDefinitionPart in worksheetPart.TableDefinitionParts)
    {
        if (tableDefinitionPart.Table.DisplayName == this._tableName)
        {
            table = tableDefinitionPart.Table;
            break;
        }
    }
}

And I can iterate over the columns in the table by foreaching over table.TableColumns.

Todd McDermid
  • 1,650
  • 11
  • 12

2 Answers2

4

To read an Excel 2007/2010 spreadsheet with OpenXML API is really easy. Somehow even simpler than using OleDB as we always did as quick & dirty solution. Moreover it's not just simple but verbose, I think to put all the code here isn't useful if it has to be commented and explained too so I'll write just a summary and I'll link a good article. Read this article on MSDN, it explain how to read XLSX documents in a very easy way.

Just to summarize you'll do this:

  • Open the SpreadsheetDocument with SpreadsheetDocument.Open.
  • Get the Sheet you need with a LINQ query from the WorkbookPart of the document.
  • Get (finally!) the WorksheetPart (the object you need) using the Id of the Sheet.

In code, stripping comments and error handling:

using (SpreadsheetDocument document = SpreadsheetDocument.Open(fileName, false))
{
   Sheet sheet = document.WorkbookPart.Workbook
      .Descendants<Sheet>()
      .Where(s => s.Name == sheetName)
      .FirstOrDefault();

   WorksheetPart sheetPart = 
      (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
}

Now (but inside the using!) what you have to do is just to read a cell value:

Cell cell = sheetPart.Worksheet.Descendants<Cell>().
    Where(c => c.CellReference == addressName).FirstOrDefault();

If you have to enumerate the rows (and they are a lot) you have first to obtain a reference to the SheetData object:

SheetData sheetData = sheetPart.Worksheet.Elements<SheetData>().First();

Now you can ask for all the rows and cells:

foreach (Row row in sheetData.Elements<Row>())
{
   foreach (Cell cell in row.Elements<Cell>())
   {
      string text = cell.CellValue.Text;
      // Do something with the cell value
   }
}

To simply enumerate a normal spreadsheet you can use Descendants<Row>() of the WorksheetPart object.

If you need more resources about OpenXML take a look at OpenXML Developer, it contains a lot of good tutorials.

Adriano Repetti
  • 65,416
  • 20
  • 137
  • 208
  • Thanks, Adriano - but how does the SheetData element relate to the Table element? I'd like to be able to know which Cells relate to which Columns... – Todd McDermid Mar 20 '12 at 23:18
  • Use the CellReference property of Cell. It contains the reference name of the cell (A1, for example). Do not stop with the few lines I wrote, I **really** suggest you take a look at the link on MSDN. There are many methods to read data: faster but harder, simpler but slower, easy but with some limitations... – Adriano Repetti Mar 20 '12 at 23:24
  • I'll read through it again, but I didn't see anything that related the Table construct with Cells on a sheet... that's what I'm looking for. I'd seen that article (and other interpretations of it) before, but I'm looking to have the Table object tell me where its cells are, not assume that the spreadsheet's cells are in a particular location. – Todd McDermid Mar 20 '12 at 23:34
  • I don't think I'm being clear. Unfortunately, Excel/SpreadsheetML is a "table", so it's confusing (and very, very hard to search for). What I'm talking about is the Excel (and SpreadsheetML) feature of converting ranges of cells to "tables" (http://msdn.microsoft.com/en-us/library/gg278320.aspx) that have more features than regular ranges. How do I relate those Tables with Cells/Sheets? – Todd McDermid Mar 21 '12 at 00:52
  • Ok, finally I understood what you mean! The **Table** element (http://msdn.microsoft.com/en-us/library/documentformat.openxml.spreadsheet.table.aspx). In the WorksheetPart you can enumerate the TableDefinitionPart object: foreach (TableDefinitionPart tablePart in sheet.TableDefinitionParts). – Adriano Repetti Mar 21 '12 at 08:56
  • Question: do you use the **OpenXML SDK 2.0 Productivity Tool** to inspect the XML? Usually relation XML <-> Objects is very strong. – Adriano Repetti Mar 21 '12 at 08:57
  • 1
    The tools helped some - but there's a lot of gaps in the documentation (IMO). Almost all of the documentation appears to be geared towards writing spreadsheets, not consuming data from them. I can understand that, but it's a pity... For those of you coming to this question later, I should have a post or two on my blog (eventually) on this topic (http://toddmcdermid.blogspot.com) under the "Microsoft Excel" tag. – Todd McDermid Mar 22 '12 at 17:26
  • You're right. The format is pretty complex and we would need a **lot** of documentation to use it well. It's a pity 'cause it's very good. I saw a tool to generate the C# code from an OpenXML file used as template, you may try to use it to understand how it's structured (but I don't remember the name!). – Adriano Repetti Mar 22 '12 at 20:51
  • 1
    I'm not quite sure why this is marked as an answer as it doesn't actually have anything about how to deal with Tables. – Chris Rae Mar 09 '17 at 20:03
  • @Chris I admit I agree, I see after few years that here I've not been clear at all. To read a table go through SheetData using cell references from your table but I stopped just after the first step (I don't know, maybe on that time I thought OP knew how to do it). Definitely need to be updated. – Adriano Repetti Mar 10 '17 at 08:55
1

There are probably many better ways to code this up, but I slapped this together because I needed it, so hopefully it will help some others.

using DocumentFormat.OpenXml.Spreadsheet;
using DocumentFormat.OpenXml.Packaging;

    private static DataTable genericExcelTable(FileInfo fileName)
    {
        DataTable dataTable = new DataTable();
        try
        {
            using (SpreadsheetDocument doc = SpreadsheetDocument.Open(fileName.FullName, false))
            {
                Workbook wkb = doc.WorkbookPart.Workbook;
                Sheet wks = wkb.Descendants<Sheet>().FirstOrDefault();
                SharedStringTable sst = wkb.WorkbookPart.SharedStringTablePart.SharedStringTable;
                List<SharedStringItem> allSSI = sst.Descendants<SharedStringItem>().ToList<SharedStringItem>();
                WorksheetPart wksp = (WorksheetPart)doc.WorkbookPart.GetPartById(wks.Id);

                foreach (TableDefinitionPart tdp in wksp.TableDefinitionParts)
                {
                    QueryTablePart qtp = tdp.QueryTableParts.FirstOrDefault<QueryTablePart>();
                    Table excelTable = tdp.Table;
                    int colcounter = 0;
                    foreach (TableColumn col in excelTable.TableColumns)
                    {
                        DataColumn dcol = dataTable.Columns.Add(col.Name);
                        dcol.SetOrdinal(colcounter);
                        colcounter++;
                    }

                    SheetData data = wksp.Worksheet.Elements<SheetData>().First();

                    foreach (DocumentFormat.OpenXml.Spreadsheet.Row row in data)
                    {
                        if (isInTable(row.Descendants<Cell>().FirstOrDefault(), excelTable.Reference, true))
                        {
                            int cellcount = 0;
                            DataRow dataRow = dataTable.NewRow();
                            foreach (Cell cell in row.Elements<Cell>())
                            {

                                if (cell.DataType != null && cell.DataType.InnerText == "s")
                                {
                                    dataRow[cellcount] = allSSI[int.Parse(cell.CellValue.InnerText)].InnerText;
                                }
                                else
                                {
                                    dataRow[cellcount] = cell.CellValue.Text;
                                }
                                cellcount++;
                            }
                            dataTable.Rows.Add(dataRow);
                        }
                    }
                }
            }
            //do whatever you want with the DataTable
            return dataTable;
        }
        catch (Exception ex)
        {
            //handle an error
            return dataTable;
        }
    }
    private static Tuple<int, int> returnCellReference(string cellRef)
    {
        int startIndex = cellRef.IndexOfAny("0123456789".ToCharArray());
        string column = cellRef.Substring(0, startIndex);
        int row = Int32.Parse(cellRef.Substring(startIndex));
        return new Tuple<int,int>(TextToNumber(column), row);
    }
    private static int TextToNumber(string text)
    {
        return text
            .Select(c => c - 'A' + 1)
            .Aggregate((sum, next) => sum * 26 + next);
    }
    private static bool isInTable(Cell testCell, string tableRef, bool headerRow){
        Tuple<int, int> cellRef = returnCellReference(testCell.CellReference.ToString());
        if (tableRef.Contains(":"))
        {
            int header = 0;
            if (headerRow)
            {
                header = 1;
            }
            string[] tableExtremes = tableRef.Split(':');
            Tuple<int, int> startCell = returnCellReference(tableExtremes[0]);
            Tuple<int, int> endCell = returnCellReference(tableExtremes[1]);
            if (cellRef.Item1 >= startCell.Item1
                && cellRef.Item1 <= endCell.Item1
                && cellRef.Item2 >= startCell.Item2 + header
                && cellRef.Item2 <= endCell.Item2) { return true; }
            else { return false; }
        }
        else if (cellRef.Equals(returnCellReference(tableRef)))
        {
            return true;
        } 
        else 
        {
            return false;
        }
    }
Kevin Pope
  • 2,964
  • 5
  • 32
  • 47