What architecture to use to address this SystemOutOfMemoryException while allowing me to instantiate the cells of a sheet?

Question

Summary

This question is the follow-up of a a desire to architect a simple spreadsheet API while keeping it user-friendly to those who know Excel well.

To sum it up, this question is related to these below two:
1. How to implement column self-naming from its index?;
2. How to make this custom worksheet initialization faster?.

Objective

To provide a simplified Excel API used as a wrapper over the nevralgic components such as the Application, the Workbook, the Worksheet and the Range classes/interfaces while exposing only the most commonly used object properties for each of these.

Usage example

This usage example is inspired from the unit tests that allowed me to bring this solution up to where it stands now.

Dim file as String = "C:\Temp\WriteTest.xls"

Using mgr As ISpreadsheetManager = New SpreadsheetManager()
    Dim wb as IWorkbook = mgr.CreateWorkbook()
    wb.Sheets("Sheet1").Cells("A1").Value = 3.1415926
    wb.SaveAs(file)
End Using

And now we open it:

Dim file as String = "C:\Temp\WriteTest.xls"

Using mgr As ISpreadsheetManager = New SpreadsheetManager()
    Dim wb as IWorkbook = mgr.OpenWorkbook(file)
    // Working with workbook here...
End Using

Discussion

While instantiating an Excel Workbook:

An instance of a Worksheet is automatically initialized in the Workbook.Sheets collection;
Upon initialization, a Worksheet initializes its Cells through the Range object that can represent one or multiple cells.

These Cells are immediately accessible with all their properties as soon as the Worksheet exists.

My wish is to reproduce this behaviour so that

The Workbook class constructor initializes the Workbook.Sheets collection property with the native sheets;
The Worksheet class constructor initializes the Worksheet.Cells collection property with the native cells.

My problem comes from the Worksheet class constructor while initializing the Worksheet.Cells collection property illustrated at #2.

Thoughts

Following these above-linked questions encountered issues, I wish to figure out another architecture that would allow me to:

Access specific feature of a cell Range when required;
Deliver most commonly used properties through my ICell interface;
Having access to all of the Range cells of a worksheet from its initialization.

While keeping in mind that accessing a Range.Value property is the fastest interaction possible with the underlying Excel application instance using the Interop.

So, I thought of initializing my ReadonlyOnlyDictionary(Of String, ICell) with the name of the cells without immediately wrapping an instance of the Range interface so that I would simply generate the row and column indexes along with the cell's name to index my dictionary, then, assigning the Cell.NativeCell property only when one wants to access or format a specific cell or cell range.

That way, the data in the dictionary would be indexed with the name of the cells obtained from the column indexes generated in the Worksheet class constructor. Then, when one would do this:

Using mgr As ISpreadsheetManager = New SpreadsheetManager()
    Dim wb As IWorkbook = mgr.CreateWorkbook()
    wb.Sheet(1).Cells("A1").Value = 3.1415926 // #1:
End Using

#1: This would allow me to use the indexes from my Cell class to write the given value to the specific cell, which is faster then using its name directly against the Range.

Questions and Concerns

Besides, when working with UsedRange.get_Value() or Cells.get_Value(), this returns Object(,) arrays.

1. So should I just be happy with working with Object(,) arrays for cells, without having the possibility to format it somehow?

2. How to architect these Worksheet and Cell classes so that I have the best performance offered while working with Object(,) arrays, while keeping the possibility that a Cell instance may represent or wrap a single cell Range?

Thanks to any of you who takes the time to read my post and my sincerest thanks to those who answer.

Why "Spreadsheet Manager"? Why not just "Spreadsheet"? If the class' role is so poorly defined that you can't think of a name for it other than "manage", then the class is likely doing to much. — Billy ONeal, Mar 03 '11 at 16:58
For that matter, what's wrong with the plain objects generated by COM interop? — Billy ONeal, Mar 03 '11 at 17:04
My `SpreadsheetManager` is responsible for launching and quitting the underlying instance of Excel which make it invisible to the user. Plus, it is used as a `Workbook` factory, as you can see when the `CreateWorkbook()` and `OpenWorkbook()` methods are called. — Will Marcouiller, Mar 03 '11 at 17:04
**What's wrong with the plain objects!?...** +1:Good question. I suppose I could simply expose the `Microsoft.Office.Interop.Excel.Worksheet.Cells` property as my `IWorksheet.Cells` property. The objective is to allow the user to address a cell with its address ("A1"), while keeping the performance of working with the row and column indexes under the hood. That is why I wanted to use a dictionary and index it with the cell's addresses, since I already expose the row and column indexes from within my Cell class. — Will Marcouiller, Mar 03 '11 at 17:07
Another thought is to use "batch-processing" when possible through the fastest access of the `Microsoft.Office.Interop.Excel.Worksheet.UsedRange.Value(XlRangeDataValue.XlRangeDataValueDefault)` method, the same with initialization or when I wish to write to the Worksheet object. To say the truth, I'm a bit confused that deep in my project. That is the reason for my consulting with the SO community so that I can breath out of the water a little and have the pulse of others. =) — Will Marcouiller, Mar 03 '11 at 17:15
I don't quite get what the API would be buying me. Is it just about simplifying the signatures, or are you trying to achieve something else? I vaguely get the sense that you want to create your cells dictionary to be able to access and iterate over individual cells fast. — Mathias, Mar 04 '11 at 18:32
@Mathias: The thing is simple. The API handles the underlying instance of the Excel Spreadsheet application so that you don't have to bother instantiating and release resources as it is wrapped into a disposable class that is also used as a workbook factory. So yes, it simplifies the signatures, plus, it also frees the user's mind to use the spreadsheet so that one can concentrate onto the essential. Hence, that is the expected purpose too as for the cells of a sheet, though I shall handle this differently now. I think of a CellCollection class that would provide the desired behaviour. — Will Marcouiller, Mar 07 '11 at 14:15

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

The used architecture has gone through an object class that I named CellCollection. Here's what it does:

Based on these hypothesis:

Given that an Excel worksheet has 256 columns and 65536 lines;

Given that 16,777,216 (256 * 65536) cells needed to be instantiated at a time;

Given that the most common use of a worksheet takes less then 1,000 lines and less than 100 columns;

Given that I needed it to be able to refer to the cells with their addresses ("A1"); and

Given that it is benchmarked that accessing all the values at once and load them into a object[,] in memory as being the fastest way to work with an underlying Excel worksheet,*

I have considered not to instantiate any of the cells, letting my CellCollection property within my IWorksheet interface initialized and empty upon instantiation, except for an existing workbook. So, when opening a workbook, I verify that NativeSheet.UsedRange is empty or return null (Nothing in Visual Basic), otherwise, I have already gotten the used "native cells" in memory so that only remains to add them in my internal CellCollection dictionary while indexing them with their respective address.

Finally, Lazy Initialization Design Pattern to the rescue! =)

public class Sheet : ISheet {
    public Worksheet(Microsoft.Office.Interop.Excel.Worksheet nativeSheet) {
        NativeSheet = nativeSheet;
        Cells = new CellCollection(this);
    }

    public Microsoft.Office.Interop.Excel.Worksheet NativeSheet { get; private set; }

    public CellCollection Cells { get; private set; }
}

public sealed class CellCollection {
    private IDictionary<string, ICell> _cells;
    private ReadOnlyDictionary<string, ICell> _readonlyCells;

    public CellCollection(ISheet sheet) {
        _cells = new Dictionary<string, ICell>();
        _readonlyCells = new ReadonlyDictionary<string, ICell>(_cells);
        Sheet = sheet;
    }

    public readonly ReadOnlyDictionary<string, ICell> Cells(string addresses) {
        get {
            if (string.IsNullOrEmpty(addresses) || 0 = address.Trim().Length)
                throw new ArgumentNullException("addresses");

            if (!Regex.IsMatch(addresses, "(([A-Za-z]{1,2,3}[0-9]*)[:,]*)"))
                throw new FormatException("addresses");

            foreach(string address in addresses.Split(",") {
                Microsoft.Office.Interop.Excel.Range range = Sheet.NativeSheet.Range(address)

                foreach(Microsoft.Office.Interop.Excel.Range cell in range) {
                    ICell c = null;
                    if (!_cells.TryGetValue(cell.Address(false, false), c)) { 
                        c = new Cell(cell);
                        _cells.Add(c.Name, c);
                    }
                }
            }

            return _readonlyCells;
        }
    }

    public readonly ISheet Sheet { get; private set; }
}

Obviously, this is a first try shot, and it works just fine so far, with more than acceptable performance. Humbly though, I feel like it could use some optimizations, though I will use it this way for now, and optimize it later if needed.

After having written this collection, I was able to come to the expected behaviour. Now, I shall try to implement some of the .NET interfaces to make it useable against some IEnumerable, IEnumerable<T>, ICollection, ICollection<T>, etc. so that it may respectively be considered as a true .NET collection.

Feel free to comment and bring constructive alternatives and/or changes to this code so that it may become even greater than it currently is.

I DO hope this will serve one's purpose someday.

Thanks for reading! =)

What architecture to use to address this SystemOutOfMemoryException while allowing me to instantiate the cells of a sheet?

1 Answers1

Linked