1

I have a C# Array of objects that needs to stay in order that I'm trying to filter duplicates out of (not same object refference just same attribute values). The catch is the duplicate that has to go is the first one and the oldest needs to stay.

Current algorithm (semi pseudo code renamed everything) using IEnumerable

        object[] filter(object[] array)
        {
            var set = new HashSet<Guid>();
            var filtered = new List<object>();

            for (int i = array.Length; i-- > 0;)
            {
                var item = array[i];

                if (!set.Contains(item.ID))
                {
                    set.Add(item.ID);
                    filtered = new List<object>(filtered.Prepend(item));
                }
            }

            return filtered.ToArray();
        }

I know it is currently O(n) but I am looking for a very efficient way of doing this. If possible with just arrays so I don't need to use .ToArray() and iterate again.

I could just make filtered an array of size array.length and put it in backwards i.e. "filtered[array.length-i] = item" but I don't want to have empty values.

canton7
  • 37,633
  • 3
  • 64
  • 77
JasperDaDolphin
  • 172
  • 1
  • 9
  • 5
    You should just check if `set.Add(item.ID)` returns true, rather than checking `Contains` first. [Docs](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.hashset-1.add?view=net-5.0): _"Returns boolean: true if the element is added to the HashSet object; false if the element is already present."_ – ProgrammingLlama Mar 29 '21 at 08:46
  • 2
    Sounds like you need a stack... – Sweeper Mar 29 '21 at 08:47
  • 1
    Given that you don't know how large the output will be, either you need to iterate twice, or you can't go straight for an array. Pick one. – canton7 Mar 29 '21 at 08:55

2 Answers2

3

Pushing to a stack can be thought of as adding to the start of a list, and popping from a stack can be thought of as removing an item from the start of a list.

Stack<T>.Push is a constant time operation as long as the stack has enough capacity, as the documentation says, so you can use a stack instead:

// using object[] doesn't make sense here as it doesn't have an ID property,
// so I have taken the liberty to create my interface
IHasID[] Filter(IHasID[] array)
{
    var set = new HashSet<Guid>();
    // if not many elements are expected to be filtered, giving the stack a initial capacity might be better
    var filtered = new Stack<IHasID>(/*array.Length*/);

    for (int i = array.Length; i-- > 0;)
    {
        var item = array[i];

        if (set.Add(item.ID))
        {

            filtered.Push(item);
        }
    }

    // ToArray creates an array in the pop order, O(n)
    // https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.stack-1.toarray?view=net-5.0#remarks
    return filtered.ToArray();
}

interface IHasID
{
    Guid ID { get; }
}
Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • 1
    (This doesn't meet OP's requirement of "No `.ToArray()` if possible", but I don't think it is possible to meet that) – canton7 Mar 29 '21 at 08:58
  • @canton7 OP said "if possible", so it's not exactly a "requirement" :-) – Sweeper Mar 29 '21 at 08:59
  • Thank you Sweeper, I'll tick this as the answer. As I think about it I cant really get any better than O(n) * 2. I was trying to see if I could do this with just an array but I don't see another way. – JasperDaDolphin Mar 29 '21 at 09:00
  • Do you know what the max capacity of a stack is? – JasperDaDolphin Mar 29 '21 at 09:12
  • 1
    @AlexMika See the documentation of [`Stack`](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.stack-1?view=net-5.0). You can specify a capacity when you create the stack. If you don't, some default, undocumented capacity is used. When that is full, it will increase the capacity automatically, by creating a bigger array and copy all its elements to it. This is similar to how `List` works. – Sweeper Mar 29 '21 at 09:18
1

Just use LINQ and it will be single O(n) CPU, O(n) RAM passthrough iterator without any further allocations:

var result = input.Reverse().DistinctBy(x=> x.YourKey);

Sample of implementation is here - LINQ's Distinct() on a particular property

You can also do same thing like this, cause all it does is just create group iterators:

var result = input.Reverse().GroupBy(x=> x.YourKey).Select(x=> x.First());
eocron
  • 6,885
  • 1
  • 21
  • 50
  • To meet OP's requirements, you need a `.Reverse().ToArray()` at the end. That's going to encure more allocations. – canton7 Mar 29 '21 at 10:32