2

Following are the approaches I tried:

A) I tried to delete rows from an excel sheet using Microsoft.Office.Interop.Excel. I'm doing this in a script task within a SSIS package.

I added the library to the GAC, since it was raising an error : Could not load Library.

Now it's raises this error saying : Retrieving the COM class factory for component with CLSID {00024500-0000-0000-C000-000000000046} failed due to the following error: 80040154.

Googling this tells me I need MS Office installed for it to work, which I don't want coz the server I deploy this solution on is definitely not going to have MS Office installed on it. I'm no expert, but I would like to know why such operations are not possible, by simply adding reference to a dll? Why is it mandatory to install MS Office.

B) I also tried Oledb jet provider, but this one doesn't allow deleting of rows. The only operations it supports is Insert, Update and Select.

Things I have come across on the web:

A) A SO Questions' answer suggests to use Npoi, but I can't totally rely on that, because what's free library today can become paid in future.

B) Also I have come across EPP Plus library. I have used it and understand that it's based on a GNU public license, but I'm apprehensive on using it because it may become a paid tool in future.

C) I have also come across people using Open XML SDK by Microsoft. Before I get my hands dirty in this, I would love if someone up front tells me whether I should be using this. Not that I'm lazy to try it out myself but what what would be helpful to me before I start is, does this SDK need any external programs installed on the machine. Coz it requires me to install an msi to be able to us it.

Is there a work around to do this using Microsoft COM components? I'm not asking a subjective question here. I want to know technical obstacles, if any when I use the above three researched tools.

Thanks in advance

Community
  • 1
  • 1
Pavitar
  • 4,282
  • 10
  • 50
  • 82

4 Answers4

0

The point is with Interop that you indeed must have office installed. So bluntly said, you cannot use Interop. If you only need to support xlsx files, you can do it in xml.

See this and this link for more details about unpacking xlsx files, editing and repacking. The only thing you need than is something to unzip it and your own xml handling code.

If the requirement is to also support xls files you have a bit of a problem. I tried this in the past without any additional installations but did not succeed, so I decided to only support xlsx. I either needed some .msi files or office installed on the server.

Community
  • 1
  • 1
Mixxiphoid
  • 1,044
  • 6
  • 26
  • 46
0

You're saying that you are using a script task in SSIS; then why not import the excel file you want to delete the values from it (preferably into a database or keep it cached into a datatable) and then generate a new xls file with just the data you want to keep.

OR don't use the script task at all and use, inside a data flow, a configured excel source combined with a script component (which is basically the same thing as a script task just that you can use this one only in a data flow) and do all your work there. If you have a dynamic connection to the excel file, you can always use variables (parameters if you're on DataTools) to configure such a connection.

Good luck!

Sergiu
  • 432
  • 6
  • 18
0

If you want to use Microsoft.Office.Interop.Excel then, yes, you do need Excel on the server. Therefore, so long as you only want to deal with xlsx based workbooks / 2007+ then I would suggest that OpenXML is the way to go. It's a bit of a learning curve and you get to realise how much work Excel does for you in the background but is not too bad once you get used to it.

A very quick sample knocked up in LINQPad:

void Main()
{
    string fileName = @"c:\temp\delete-row-openxml.xlsx";

    using (SpreadsheetDocument doc = SpreadsheetDocument.Open(fileName, true))
    {
        // Get the necessary bits of the doc
        WorkbookPart workbookPart = doc.WorkbookPart;
        SharedStringTablePart sstpart = workbookPart.GetPartsOfType<SharedStringTablePart>().First();
        SharedStringTable sst = sstpart.SharedStringTable;

        // Get the first worksheet
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
        Worksheet sheet = worksheetPart.Worksheet;

        var rows = sheet.Descendants<Row>();

        foreach (Row row in rows.Where(r => ShouldDeleteRow(r, sst)))
        {
            row.Remove();
        }
    }
}

private bool ShouldDeleteRow(Row row, SharedStringTable sst)
{
    // Whatever logic to apply to decide whether to remove a row or not
    string txt = GetCellText(row.Elements<Cell>().FirstOrDefault(), sst);
    return (txt == "Row 3");
}

// Basic way to get the text of a cell - need to use the SharedStringTable
private string GetCellText(Cell cell, SharedStringTable sst)
{
    if (cell == null)
        return "";

    if ((cell.DataType != null) && (cell.DataType == CellValues.SharedString))
    {
        int ssid = int.Parse(cell.CellValue.Text);
        string str = sst.ChildElements[ssid].InnerText;
        return str;
    }
    else if (cell.CellValue != null)
    {
        return cell.CellValue.Text;
    }
    return "";
}

Note that this will clear the row not shuffle up all the other rows. To do that you'd need to provide some logic to adjust row indexes of the remaining rows.

To answer a little more of the OP question - the OpenXML msi is all that is needed apart from the standard .Net framework. The sample needs a reference to WindowsBase.dll for the packaging API and using statements for DocumentFormat.OpenXml.Packaging and DocumentFormat.OpenXml.Spreadsheet. The OpenXML API package can be referenced in VS via Nuget too so you don't even need to install the msi if you don't want. But it makes sense to do so IMHO.

One other item that you will find VERY useful is the OpenXML tools msi. This lets you open a Word or Excel doc and see the XML layout inside - most helpful.

shunty
  • 3,699
  • 1
  • 22
  • 27
  • this sounds interesting... can you really open an Excel file without having Excel and work with it.. I really want to try this –  May 06 '14 at 08:52
  • @me-how Yes, that's the point of OpenXML. The xlsx format is essentially a zipped up bunch of xml docs. OpenXML is not REQIURED but is a helpful layer over the top of standard xml. – shunty May 06 '14 at 09:23
  • @shunty I get This error -> 'System.Collections.Generic.IEnumerable' does not contain a definition for 'First' and no extension method 'First' accepting a first argument of type 'System.Collections.Generic.IEnumerable' could be found (are you missing a using directive or an assembly reference?) – Pavitar May 06 '14 at 11:09
  • @Pavitar It's a LINQ extension method. Add "using System.Linq;" to the source file (automatically included if you use LINQPad). Alternatively you can use "SharedStringTablePart sstpart = workbookPart.SharedStringTablePart;" in place of the offending line. – shunty May 06 '14 at 11:26
  • @Shunty Thanks - The problem here was that my project was targeting .Net 2.0 where as OpenXml needs .Net 3.5+ which also has linq. I want to delete a row at particular index, can you help me with some sample code.. – Pavitar May 06 '14 at 12:07
  • @Pavitar - Just use the RowIndex property of the row. eg: row = rows.FirstOrDefault(r => r.RowIndex == 11); Be aware, however, that the OpenXML format only stores information for elements that have data. In other words if the row is already empty and has not been used before then the row variable may well be null and so FirstOrDefault will return null. Just because there's a row with index 10 and index 14 doesn't mean there'll be one with row index = 12! – shunty May 06 '14 at 12:42
  • @shunty I have tried the above code. I have a strange problem. When my excel file is small in size, it successfully removes the row data, but when it's large it throws this error -> A first chance exception of type 'System.IO.IsolatedStorage.IsolatedStorageException' occurred in System.Xml.dll – Pavitar May 06 '14 at 13:54
  • No idea, sorry. Like I said, this is a quick and dirty sample to show the basic idea. Not meant to be production ready. But it could be something to do with file sharing / locking or perhaps multi-threading. – shunty May 07 '14 at 07:59
0

This is how I managed to remove rows in excel and move up the data

using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Spreadsheet;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;


using (SpreadsheetDocument document = SpreadsheetDocument.Open(pathToFile, true))
{
   WorkbookPart wbPart = document.WorkbookPart;

   var worksheet = wbPart.WorksheetParts.First().Worksheet;
   var rows = worksheet.GetFirstChild<SheetData>().Elements<Row>();

   
   // Skip headers 
   foreach (var row in rows.Skip(1))
   {
      if (/* some condition on which rows to delete*/)
      {
         row.Remove();
      }
   }

   // Fix all row indexes
   string cr;
   for (int i = 2; i < rows.Count(); i++)
   {
      var newCurrentRowIndex = rows.ElementAt(i - 1).RowIndex.Value + 1;
      var currentRow = rows.ElementAt(i);

      currentRow.RowIndex.Value = updatedRowIndex;
      IEnumerable<Cell> cells = currentRow.Elements<Cell>().ToList();

      if (cells != null)
      {
         foreach (Cell cell in cells)
         {
            cr = cell.CellReference.Value;
            cr = Regex.Replace(cell.CellReference.Value, @"[\d-]", "");
            cell.CellReference.Value = $"{cr}{updatedRowIndex}";
         }
      }
   }

   worksheet.Save();
}
Tomaž Šuen
  • 131
  • 5