1

I have a collection of objects IEnumerable<object> obs. I have another collection of objects IEnumerable<object> data.

For each ob in obs I need to find the first item in data that has the same value in a certain property as ob. For example I could be looking for the first item in data that has the same ToString() value as ob. When the first item where the property values match is found, I do something with the found data item and then I check the next ob in obs. If none is found, I throw an error.

Here is a naive approach:

foreach (object ob in obs)
{
    foreach (object dataOb in data)
        if (ob.ToString() == dataOb.ToString())
        {
            ... // do something with dataOb
            goto ContinueOuter;
        }
    throw new Exception("No matching data found.");

    ContinueOuter: ;
}

The disadvantage is that I calculate dataOb.ToString() every time, which is unnecessary. I could cache it:

IDictionary<object, string> dataToDataStr = new Dictionary<object, string>();
foreach (object dataObj in data) // collect all ToString values in advance
    dataToDataStr.Add(dataObj, dataObj.ToString());

foreach (object ob in obs)
{
    foreach (object dataOb in dataToDataStr.Keys)
        if (ob.ToString() == dataToDataStr[dataOb])
        {
            ... // do something with dataOb
            goto ContinueOuter;
        }
    throw new Exception("No matching data found.");

    ContinueOuter: ;
}

The disadvantage is that I calculate all ToString() values even though it might not be necessary. I might find all matching data objects in the first half of the data collection.

How can I build up the dataToDataStr dictionary (or any other enumerable data structure that lets me retrieve both the object and its only-once-calculated ToString value) lazily?

Here is code (mixed with pseudocode) of what I have in mind:

IDictionary<object, string> dataToDataStr = new Dictionary<object, string>();
object lastProcessedDataOb = null;

foreach (object ob in obs)
{
    foreach (object dataOb in dataToDataStr.Keys)
        if (ob.ToString() == dataToDataStr[dataOb])
        {
            ... // do something with dataOb
            goto ContinueOuter;
        }

    foreach (object dataOb in data STARTING AFTER lastProcessedDataOb)
    // if lastProcessedDataOb == null, start with the first entry of data
    {
        dataToDataStr.Add(dataOb, dataOb.ToString();
        lastProcessedDataOb = dataOb;

        if (ob.ToString() == dataToDataStr[dataOb])
        {
            ... // do something with dataOb
            goto ContinueOuter;
        }
    }
    throw new Exception("No matching data found.");

    ContinueOuter: ;
}

I know it is easy if data was a LinkedList or any collection with indexed access (then I could store a linked list node or an index as lastProcessedDataOb), but it isn't - it is an IEnumerable. Maybe yield return can be used here?

Kjara
  • 2,504
  • 15
  • 42
  • Maybe you should describe why you compare your objects via ToString in the first place. That feels wrong. Why don't you use the standard way of implementing Equals/GetHashcode or a proper implementation of the IEquatable/IComparable Interfaces? The implementation might than contain a caching mechanisms if calculating is expensive. That would make life easier for the users of such classes. – Ralf Nov 15 '17 at 10:19
  • Your comment has nothing to do with the point of the question (building up enumerables lazily). Also, I did write "I need to find the first item in data that has the same value in a certain property as ob. FOR EXAMPLE [...] `ToString` [...]". – Kjara Nov 15 '17 at 10:25

2 Answers2

1

If your collections are really large and you really don't want to evaluate ToString for each item of data, you could use the following approach:

  1. Create cache of already calculated items
  2. If certain item is found i cache - that's great, we have a match.
  3. Otherwise - continue populating cache by iterating over data collection until we find a match. This can be efficiently be done with manually controlling **Enumerator** of data collection (instead of using foreach).

    IEnumerable<object> obs;
    IEnumerable<object> data;
    Dictionary<string, object> dataCache = new Dictionary<string, object>();
    
    var dataIterator = data.GetEnumerator();
    foreach (var ob in obs)
    {
        var obText = ob.ToString();
        object matchingDataItem = null;
        if (!dataCache.TryGetValue(obText, out matchingDataItem))
        {
            while (dataIterator.MoveNext())
            {
                var currentData = dataIterator.Current;
                var currentDataText = currentData.ToString();
                if (!dataCache.ContainsKey(currentDataText)) // Handle the case when data collection contains duplicates
                {
                    dataCache.Add(currentDataText, currentData);
                    if (currentDataText == obText)
                    {
                        matchingDataItem = currentData;
                        break;
                    }
                }
            }
        }
        if (matchingDataItem != null)
        {
            Console.WriteLine("Matching item found for " + obText);
        }
        else
        {
            throw new Exception("No matching data found.");
        }
    }
    

This way you can guarantee to iterate over data collection only to the point when all obs items are found and you won't evaluate ToString for each item more then once.

PS: I hope "ToString" is just for example and you have some complex calculations there, which is worth such complexities...

Sasha
  • 8,537
  • 4
  • 49
  • 76
  • There should be a check if `dataCache` already contains the key `currentDataText` before the `Add` call. In case the key exists, do a `continue`. – Kjara Nov 15 '17 at 12:33
  • You still check for string equality (`currentDataText == obText`) in case the key is a duplicate. This is not necessary as it has already been checked via`dataCache.TryGetValue(...)` earlier. – Kjara Nov 15 '17 at 13:09
  • Added that optimization as well, although it looks tiny. – Sasha Nov 15 '17 at 14:01
-1

Totally forgot that LINQ uses lazy evaluation... This should work (I use the new value tuple notation of C# 7):

IEnumerable<(object, string)> objStrPairs = data.Select(o => (o, o.ToString()));
foreach (object ob in obs)
{
    foreach ((object, string) dataPair in objStrPairs)
        if (ob.ToString() == objStrPairs.Item2)
        {
            ... // do something with objStrPairs.Item1
            goto ContinueOuter;
        }
    throw new Exception("No matching data found.");

    ContinueOuter: ;
}
Kjara
  • 2,504
  • 15
  • 42
  • Why are you even using `goto`? That's not deprecated only for compatibility reasons... – Camilo Terevinto Nov 15 '17 at 10:16
  • @CamiloTerevinto Java has the possibility to name a loop - e.g. the outer loop could be named "outer" - and then, deep inside a nested loop, call `continue outer`. Since C# doesn't have this feature, `goto` is the way to go. – Kjara Nov 15 '17 at 10:22
  • The common way to go is understanding that c# is not java. Normally, one would add a boolean variable to the outer loop to know what happened in the inner loop. You could also even use a single LINQ statement for this entire answer. – Camilo Terevinto Nov 15 '17 at 10:24
  • @CamiloTerevinto "You could also even use a single LINQ statement for this entire answer." Please share your answer then! – Kjara Nov 15 '17 at 10:26
  • @CamiloTerevinto Regarding your comment about `goto`, please have a look at https://stackoverflow.com/a/11913707/5333340 – Kjara Nov 15 '17 at 10:29
  • 1
    If you are afraid of calculating redundant items in collection (O(N)) complexity, why do you iterate over data collection again and again for each item, making it O(N*N) complexity even not getting to the end of collection... – Sasha Nov 15 '17 at 10:52