0

We have a need for finding formula errors in a large workbook of about 3 million cells. Performance is critical. We have implemented a wrapper function around special cells that allows us to catch an exception in the case of no error cells being found on that worksheet. Similar to:

Handling "No cells were found." Error in Excel

But the problem is the check becomes painfully slow when the COM Exception is thrown. More concretely, when there is even one error on a worksheet, we've recorded execution times of between 1 and 4ms. When the error is thrown we're looking at around 80ms.

Does anyone have any creative fast alternative to checking for the non-existence of errors?

Our code:

try
{
  dynamic cellsWithErrors = cells.SpecialCells(cellType, valueType);
  return cellsWithErrors;
}
catch (COMException)
{
  return null;
}         

Notes:

  • The finding of errors when they exist is fine so no need to focus on that.

  • I already found a moderate speedup by using dynamic types instead of Range types

  • Our check iterates through all sheets in the entire workbook. So we'd be happy for any solution that's workbook wide too, but doubt it exists.

  • We're writing this as part of a VSTO AddIn and using the Interop layer.

Solutions Already Considered

We could read in the entire range as values and iterate through it, but was hoping to see if there is something else. Because in the case where there are errors this will be slower.

Current Idea

So the conundrum boils down to this. On the one hand, SpecialCells is exceptionally fast when it does find results. But the throwing of the exception can lead to a factor of between 20 and 80 of a slowdown. On the other hand, reading the entire used range of a sheet is pretty consistent in terms of performance: I'd say it's linear in the number of cells with a small once off constant downpayment. Now, by inspecting cell values it may be possible to find errors, using Mike Rosenblum's answer here: How to know if a cell has an error in the formula in C#. If so, then we'd have an alternative way to check errors.

The idea then is to write a master function that pivots on the size of the sheet in question. If it's a very large sheet (we have one with around 2 million cells I think), then we might as well just use SpecialCells and accept the cost of an exception if it occurs. But for smaller sheets, where the cost of an exception outweighs the cost of reading in the entire range and iterating through it in memory, then we should just do the latter.

The key thing we'd need to determine for that approach is what the crossing point is: where does SpecialCells begin to outperform read & iterate? We may also want to use our own knowledge of our workbook to heuristically guess which sheets are likely to have errors and which aren't. I'll post an answer if we develop one and I have time.

Colm Bhandal
  • 3,343
  • 2
  • 18
  • 29

1 Answers1

0

It turns out that the performance statistics from the OP's question were wrong. They were recorded running in debug mode. So the real answer to this question is: don't do your perf testing in debug mode. Duh. There must be some huge cost to exception handling when in debug mode, though we noticed vast differences on my machine to my teammate's machine, so maybe it's some local debug settings of mine. I'm sure others would be able to comment more on this.

The true performance of the SpecialCells function was 4ms on average, when running outside of debug mode. This was perfectly acceptable for our use case. However, we had already refactored our code by the time this was discovered, and in fact got a further performance increase...

Silver Lining

In the end, we went with a hybrid solution. If the sheet in question had more than 200_000 cells, we just used the SpecialCells callout, costing 4ms on average. But if the sheet had less than this number of cells, we read in the entire used range and iterated through all its cells, checking for error types by looking at the values, in a similar vein to Mike Rosenblum's answer: How to know if a cell has an error in the formula in C#. The cost of looping through those < 200_000 cells, on average, for us, slightly cheaper than doing the callout to SpecialCells. It brought our average down from 4ms to 3ms. Modest, but a speedup nonetheless.

Colm Bhandal
  • 3,343
  • 2
  • 18
  • 29