0

The following code fails multiple enumeration because the existingNames hash set still contains the results of the last enumeration, thus the numeric suffixes are advanced more than is correct. What's an elegant way to soup up this method so that it works correctly upon multiple enumeration?

public static IEnumerable<TOutput> UniquifyNames<TSource, TOutput>(
   this IEnumerable<TSource> source,
   Func<TSource, string> nameSelector,
   Func<TSource, string, TOutput> resultProjection
) {
   HashSet<string> existingNames = new HashSet<string>();
   return source
      .Select(item => {
         string name = nameSelector(item);
         return resultProjection(
            item,
            Enumerable.Range(1, int.MaxValue)
               .Select(i => {
                  string suffix = i == 1
                     ? ""
                     : (name.EndsWithDigit() ? "-" : "") + i.ToString();
                  return $@"{name}{suffix}";
               })
               .First(candidateName => existingNames.Add(candidateName))
         );
      });
}

private static bool EndsWithDigit(this string value)
   => !string.IsNullOrEmpty(value) && "0123456789".Contains(value[value.Length - 1]);

I thought about creating an extension method such as UponEnumeration to wrap the outer enumerable, which would take a callback Action to run when enumeration began again (and which could be used to reset the HashSet). Is that a good idea?

I just realized it's not a good idea as stated, because the same resulting IEnumerable could be enumerated by different classes at the same time (begin enumerating in one place, while the other was still partway through enumeration, so things would break after resuming enumeration, because the HashSet got cleared). It sounds like the best thing to do is simply ToList() but I really would like to preserve lazy evaluation if possible.

ErikE
  • 48,881
  • 23
  • 151
  • 196
  • Small mistake, "s" does not exist, I bet you meant "item" as the item is of TSource. – ipavlu Feb 01 '16 at 22:42
  • Also, the whole algorithm based on hashset and then issues with Enumerable.Range.Select.First is really overcomplicated, not to mention making that for each element of the sequence? Try instead to use Dictionary, testing if the key, the name, exist. If not, then first, and must be added. If yes, incrementing stored value under the key. It would cut off all the fun with Enumerable.Range.Select.First(hashset.add), improve speed and created less garbage for the GC. – ipavlu Feb 01 '16 at 23:21
  • @ipavlu First, a `Dictionary` is a heavier-weight object than a `HashSet`. Feel free to answer with an alternate implementation, and we can speed test it. Second, your described implementation would fail unless you first stripped the numbers off the end of the names, as the set of keys `"Value", "Value2", "Value"` would blow up (my implementation would generate "`Value", "Value2", "Value3"` whereas yours would create the incorrect `"Value", "Value2", "Value2"` (since on the second `Value` you only checked `Value`). – ErikE Feb 01 '16 at 23:37
  • @ipavlu I did mean `item`, not `s`, thanks. I originally had `s` and changed it at the last minute... – ErikE Feb 01 '16 at 23:37
  • @ipavlu If `HashSet` and `Dictionary` aren't that much different, then your original point is completely lost! Using a `HashSet` is perfectly reasonable. Why dummy up a fake `Value` for the `KeyValuePair` when you don't even need that? All you need to know is if the item exists in the `HashSet`, or doesn't exist. Your advice is way off, in my opinion. If and when I have a performance problem, I'll test, locate the issue, and rewrite. In the meantime, you're suggesting doing contortions for imagined performance problems you don't have any knowledge of. I write for style, not the GC. – ErikE Feb 02 '16 at 00:33
  • @ipavlu Put your money where your mouth is and write an implementation with a dictionary and we'll see who is "still not getting it". (Hint: you!) Words are cheap. Action is called for. – ErikE Feb 03 '16 at 01:37
  • And for the record and future readers, ipavlu didn't understand my code and wrote an incorrect implementation... which I had *tried* to help him avoid with my *first* comment here in reply to his comments. – ErikE Feb 06 '16 at 02:05

2 Answers2

2

By making your code a deferred IEnumerable itself when other people run it multiple times it will also be run multiple times.

public static IEnumerable<TOutput> UniquifyNames<TSource, TOutput>(
   this IEnumerable<TSource> source,
   Func<TSource, string> nameSelector,
   Func<TSource, string, TOutput> resultProjection
) {
   HashSet<string> existingNames = new HashSet<string>();
   var items = source
      .Select(item => {
         string name = nameSelector(item);
         return resultProjection(
            item,
            Enumerable.Range(1, int.MaxValue)
               .Select(i => {
                  string suffix = i == 1
                     ? ""
                     : (name.EndsWithDigit() ? "-" : "") + i.ToString();
                  return $@"{name}{suffix}";
               })
               .First(candidateName => existingNames.Add(candidateName))
         );
      });
    foreach(TOutput item in items)
    {
        yield return item;
    }
}

Me personally, if I was doing this for real, I would "unroll" the LINQ queries and do their equivalents myself inside the foreach loop. Here is my first quick stab at changing it over.

public static IEnumerable<TOutput> UniquifyNames<TSource, TOutput>(
    this IEnumerable<TSource> source,
    Func<TSource, string> nameSelector,
    Func<TSource, string, TOutput> resultProjection
    )
{
    HashSet<string> existingNames = new HashSet<string>();
    foreach (TSource item in source)
    {
        string name = nameSelector(item);
        yield return resultProjection(item, GenerateName(name, existingNames));
    }
}

private static string GenerateName(string name, HashSet<string> existingNames)
{
    return Enumerable.Range(1, int.MaxValue)
        .Select(i =>
        {
            string suffix = i == 1
                ? ""
                : (name.EndsWithDigit() ? "-" : "") + i.ToString();
            return $@"{name}{suffix}";
        }).First(existingNames.Add);
}

Note that best practice for yielding/deferred IEnumerables is to check for null parameters in one method, and then return the result of the actual private implementation. This is so that the IEnumerable in the error case will throw immediately upon invocation/creation, instead of later once it is enumerated (possibly in code far away from the code that created it).

public static IEnumerable<TOutput> UniquifyNames<TSource, TOutput>(
   this IEnumerable<TSource> source,
   Func<TSource, string> nameSelector,
   Func<TSource, string, TOutput> resultProjection
) {
   if (source == null) {
      throw new ArgumentNullException(nameof(source));
   }
   if (nameSelector == null) {
      throw new ArgumentNullException(nameof(nameSelector));
   }
   if (resultProjection == null) {
      throw new ArgumentNullException(nameof(resultProjection));
   }
   return UniquifyNamesImpl(source, nameSelector, resultProjection);
}
ErikE
  • 48,881
  • 23
  • 151
  • 196
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
  • That's indeed a very simple way to make it multiple-enumerable! I learned something good today. This makes the answer I did come up with probably too heavy. – ErikE Feb 01 '16 at 22:23
  • @ErikE make sure you refresh, I tweaked "my version" a little more to make UniquifyNames simpler by moving the `.First()` in to `GenerateName` – Scott Chamberlain Feb 01 '16 at 22:27
0

I did come up with a way that works, but I don't know if it's good:

public class ResettingEnumerable<T> : IEnumerable<T> {
    private readonly Func<IEnumerable<T>> _enumerableFetcher;

    public ResettingEnumerable(Func<IEnumerable<T>> enumerableFetcher) {
        _enumerableFetcher = enumerableFetcher;
    }

    public IEnumerator<T> GetEnumerator() => _enumerableFetcher().GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

The body of UniquifyNames then turns into this:

return new ResettingEnumerable<TOutput>(() => {
   /* old body here */
};

After seeing Scott Chamberlain's answer, though, I think his idea is probably better: just write it as a yielding IEnumerable which has the property of being run anew each time GetEnumerator is called. This is a great general solution to the issue, where when multiple enumeration can't be tolerated, switch to a deferred IEnumerable.


For the record, I selected a slightly different final implementation. Initially, I wanted to preserve the lazy-evaluation aspect of an IEnumerable, where the set could be less than fully-enumerated and produce useful results. However, I realized that my goal of mutating any existing names as little as possible led me to choose a different algorithm that requires fully enumerating the list (to take all names as-is that can be taken, before beginning any numeric incrementing). Here is that solution for you:

private class NamedItem<TSource> {
   public TSource Item { get; set; }
   public string Name { get; set; }
}

private static bool EndsWithADigit(this string value) =>
   !string.IsNullOrEmpty(value) && "0123456789".Contains(value[value.Length - 1]);

private static string GetNumberedName(string name, int index) =>
   name + (index == 1 ? "" : name.EndsWithADigit() ? $"-{index}" : $"{index}");

private static bool ConditionalSetName<T>(
   NamedItem<T> namedItem, string name, HashSet<string> hashset
) {
   bool isNew = hashset.Add(name);
   if (isNew) { namedItem.Name = name; }
   return !isNew;
}

public static IEnumerable<TOutput> UniquifyNames<TSource, TOutput>(
   this IEnumerable<TSource> source,
   Func<TSource, string> nameSelector,
   Func<TSource, string, TOutput> resultProjection
) {
   var seen = new HashSet<string>();
   var result = source.Select((item, seq) => new NamedItem<TSource>{
      Item = item, Name = nameSelector(item)
   }).ToList();
   var remaining = result;
   int i = 1;
   do {
      remaining = remaining.Where(namedItem =>
         ConditionalSetName(namedItem, GetNumberedName(namedItem.Name, i++), seen)
      ).ToList();
   } while (remaining.Any());
   return result.Select(namedItem => resultProjection(namedItem.Item, namedItem.Name));
}

With this input:

"String2", "String", "String", "String3", "String3"

This gives the result:

"String2", "String", "String4", "String3", "String3-2"

This is better because the name String3 is left untouched.

My original implementation gave this result:

"String2", "String", "String3", "String3-2", "String3-3"

This is worse because it mutates the first String3 unnecessarily.

Community
  • 1
  • 1
ErikE
  • 48,881
  • 23
  • 151
  • 196
  • I'm holding my breath! Hurry up! – ErikE Feb 08 '16 at 03:17
  • (1) I spent some time on this and you change it, but ok, I proved that your first implementation can be written in a way that will be fast for few items or tens of thousands unique/nonunique items without burning the CPU for seconds/minutes:). Unfortunately, I do not have too much time, I am somehwat waiting on "GO" from the new employer, including moving to different city etc, so there is not much time to play with new implementation... – ipavlu Feb 08 '16 at 03:17
  • Regarding 1) did you performance-test Scott Chamberlain's improvement (two functions) with a simple for loop instead of the `Enumerable.Range`? Also, I still disagree. Clarity and conciseness win out over performance when the performance characteristics are not part of correctness. Correctness, here, does not require tens of thousands of names with thousands of duplicates, so the performance is totally adequate (a few milliseconds typically). A comment can be left on the method, even an xml comment, that indicates its performance should be checked under these type of high-number conditions. – ErikE Feb 08 '16 at 03:18
  • (2) The power of the enumerables are in the ability to skip some items, like with First(x => something(x)), but that is not most likely your case, just a note for fully enumerated sequences... – ipavlu Feb 08 '16 at 03:20
  • Regarding 2) Agreed, and while I wanted to maintain the lazy evaluation if possible, the correctness of the implementation won out and I chose to fully evaluate, like some LINQ methods do (such as `GroupBy` or `OrderBy`). – ErikE Feb 08 '16 at 03:22
  • (3) If you are fully enumerating the sequence, then it begs for parallelization at least as a choice for n > 100 for example or more like anything that takes more than 10ms for computation. That goes to ConcurrentDictionary, would be more efficient, but sure, it creates complexity:). – ipavlu Feb 08 '16 at 03:25