4

I am transforming an Excel spreadsheet into a list of "Elements" (this is a domain term). During this transformation, I need to skip the header rows and throw out malformed rows that cannot be transformed.

Now comes the fun part. I need to capture those malformed records so that I can report on them. I constructed a crazy LINQ statement (below). These are extension methods hiding the messy LINQ operations on the types from the OpenXml library.

var elements = sheet
    .Rows()                          <-- BEGIN sheet data transform
    .SkipColumnHeaders()
    .ToRowLookup()
    .ToCellLookup()
    .SkipEmptyRows()                 <-- END sheet data transform
    .ToElements(strings)             <-- BEGIN domain transform
    .RemoveBadRecords(out discard)
    .OrderByCompositeKey();

The interesting part starts at ToElements, where I transform the row lookup to my domain object list (details: it's called an ElementRow, which is later transformed into an Element). Bad records are created with just a key (the Excel row index) and are uniquely identifiable vs. a real element.

public static IEnumerable<ElementRow> ToElements(this IEnumerable<KeyValuePair<UInt32Value, Cell[]>> map)
{
    return map.Select(pair =>
    {
        try
        {
            return ElementRow.FromCells(pair.Key, pair.Value);
        }
        catch (Exception)
        {
            return ElementRow.BadRecord(pair.Key);
        }
    });
}

Then, I want to remove those bad records (it's easier to collect all of them before filtering). That method is RemoveBadRecords, which started like this...

public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements)
{
    return elements.Where(el => el.FormatId != 0);
}

However, I need to report the discarded elements! And I don't want to muddy my transform extension method with reporting. So, I went to the out parameter (taking into account the difficulties of using an out param in an anonymous block)

public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements, out List<ElementRow> discard)
{
    var temp = new List<ElementRow>();
    var filtered = elements.Where(el =>
    {
        if (el.FormatId == 0) temp.Add(el);
        return el.FormatId != 0;
    });

    discard = temp;
    return filtered;
}

And, lo! I thought I was hardcore and would have this working in one shot...

var discard = new List<ElementRow>();
var elements = data
    /* snipped long LINQ statement */
    .RemoveBadRecords(out discard)
    /* snipped long LINQ statement */

discard.ForEach(el => failures.Add(el));

foreach(var el in elements) 
{ 
    /* do more work, maybe add more failures */ 
}

return new Result(elements, failures);

But, nothing was in my discard list at the time I looped through it! I stepped through the code and realized that I successfully created a fully-streaming LINQ statement.

  1. The temp list was created
  2. The Where filter was assigned (but not yet run)
  3. And the discard list was assigned
  4. Then the streaming thing was returned

When discard was iterated, it contained no elements, because the elements weren't iterated over yet.

Is there a way to fix this problem using the thing I constructed? Do I have to force an iteration of the data before or during the bad record filter? Is there another construction that I've missed?

Some Commentary

Jon mentioned that the assignment /was/ happening. I simply wasn't waiting for it. If I check the contents of discard after the iteration of elements, it is, in fact, full! So, I don't actually have an assignment problem. Unless I take Jon's advice on what's good/bad to have in a LINQ statement.

Community
  • 1
  • 1
Anthony Mastrean
  • 21,850
  • 21
  • 110
  • 188

1 Answers1

7

When the statement was actually iterated, the Where clause ran and temp filled up, but discard was never assigned again!

It doesn't need to be assigned again - the existing list which will have been assigned to discard in the calling code will be populated.

However, I'd strongly recommend against this approach. Using an out parameter here is really against the spirit of LINQ. (If you iterate over your results twice, you'll end up with a list which contains all the bad elements twice. Ick!)

I'd suggest materializing the query before removing the bad records - and then you can run separate queries:

var allElements = sheet
    .Rows()
    .SkipColumnHeaders()
    .ToRowLookup()
    .ToCellLookup()
    .SkipEmptyRows()
    .ToElements(strings) 
    .ToList();

var goodElements = allElements.Where(el => el.FormatId != 0)
                              .OrderByCompositeKey();

var badElements = allElements.Where(el => el.FormatId == 0);

By materializing the query in a List<>, you only process each row once in terms of ToRowLookup, ToCellLookup etc. It does mean you need to have enough memory to keep all the elements at a time, of course. There are alternative approaches (such as taking an action on each bad element while filtering it) but they're still likely to end up being fairly fragile.

EDIT: Another option as mentioned by Servy is to use ToLookup, which will materialize and group in one go:

var lookup = sheet
    .Rows()
    .SkipColumnHeaders()
    .ToRowLookup()
    .ToCellLookup()
    .SkipEmptyRows()
    .ToElements(strings) 
    .OrderByCompositeKey()
    .ToLookup(el => el.FormatId == 0);

Then you can use:

foreach (var goodElement in lookup[false])
{
    ...
}

and

foreach (var badElement in lookup[true])
{
    ...
}

Note that this performs the ordering on all elements, good and bad. An alternative is to remove the ordering from the original query and use:

foreach (var goodElement in lookup[false].OrderByCompositeKey())
{
    ...
}

I'm not personally wild about grouping by true/false - it feels like a bit of an abuse of what's normally meant to be a key-based lookup - but it would certainly work.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 3
    Another option would be `allElements.ToLookup(el => el.FormatId == 0);` Then you can just get the true/false value from the lookup for the good/bad items, but that would also materialize the whole query. – Servy Apr 05 '13 at 14:35
  • Hey, you're right about the assignment working! The problem was where I was checking the `discard` list. I'm updating the question to that effect. Thx for the answer, lots to chew on. – Anthony Mastrean Apr 05 '13 at 14:43
  • @Servy, please write that up as an answer with some more sample code. I'm interested. – Anthony Mastrean Apr 05 '13 at 14:43
  • @AnthonyMastrean I fail to see what other code would be needed beyond what I've already posted; and it's not going to be functionally different from Jon's code (unless the `Where` predicate were expensive, as I avoid computing it twice, but in this case it's not,). – Servy Apr 05 '13 at 14:45
  • @Servy, it's ok, I guess you have enough rep? I just saw your comment as an answer and thought you'd like to promote it. Thanks, Jon, for writing it up. – Anthony Mastrean Apr 05 '13 at 14:52
  • It's funny, Jon Skeet answers your question and no one else even bothers ;) – Anthony Mastrean Aug 13 '13 at 14:25