0

Edit: I will add some benchmark results. To about a 1000 - 5000 items in the list, IList and RemoveAt beats ISet and Remove, but that's not something to worry about since the differences are marginal. The real fun begins when collection size extends to 10000 and more. I'm posting only those data

I was answering a question here last night and faced a bizarre situation.

First a set of simple methods:

static Random rnd = new Random();
public static int GetRandomIndex<T>(this ICollection<T> source)
{
    return rnd.Next(source.Count);
}

public static T GetRandom<T>(this IList<T> source)
{
    return source[source.GetRandomIndex()];
}

------------------------------------------------------------------------------------------------------------------------------------

Let's say I'm removing N number of items from a collection randomly. I would write this function:

public static void RemoveRandomly1<T>(this ISet<T> source, int countToRemove)
{
    int countToRemain = source.Count - countToRemove; 
    var inList = source.ToList();

    int i = 0;
    while (source.Count > countToRemain)
    {
        source.Remove(inList.GetRandom()); 
        i++;
    }
}

or

public static void RemoveRandomly2<T>(this IList<T> source, int countToRemove)
{
    int countToRemain = source.Count - countToRemove;

    int j = 0;
    while (source.Count > countToRemain)
    {
        source.RemoveAt(source.GetRandomIndex()); 
        j++; 
    }
}

As you can see the first function is written for an ISet and the second for normal IList. In the first function I'm removing by item from ISet and by index in IList, both of which I believe are O(1). Why is the second function performing so much worse than the first, especially when the lists get bigger?

Odds (my take):

1) In the first function the ISet is converted to an IList (to get the random item from the IList), where as there is no such thing performed in the second function.

Advantage IList.

2) In the first function a call to GetRandomItem is made, where as in the second, a call to GetRandomIndex is made, that's one step less again.

Though trivial, advantage IList.

3) In the first function, the random item is got from a separate list, so the obtained item might be already removed from ISet. This leads in more iterations in the while loop in the first function. In the second function, the random index is got from the source that is being iterated on, hence there are never repetitive iterations. I have tested this and verified this.

i > j always, advantage IList.

I thought the reason for this behaviour is that a List would need constant resizing when items are added or removed. But apparently no in some other testing. I ran:

public static void Remove1(this ISet<int> set)
{
    int count = set.Count;
    for (int i = 0; i < count; i++)
    {
        set.Remove(i + 1);
    }
}

public static void Remove2(this IList<int> lst)
{
    for (int i = lst.Count - 1; i >= 0; i--)
    {
        lst.RemoveAt(i);
    }
}

and found that the second function runs faster.

Test bed:

var f = Enumerable.Range(1, 100000);

var s = new HashSet<int>(f);
var l = new List<int>(f);

Benchmark(() =>
{
    //some examples...

    s.RemoveRandomly1(2500);
    l.RemoveRandomly2(2500);

    s.Remove1();
    l.Remove2();

}, 1);

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

Just trying to know what's with the two structures.. Thanks..

Result:

var f = Enumerable.Range(1, 10000);

s.RemoveRandomly1(7500); => 5ms
l.RemoveRandomly2(7500); => 20ms


var f = Enumerable.Range(1, 100000);

s.RemoveRandomly1(7500); => 7ms
l.RemoveRandomly2(7500); => 275ms


var f = Enumerable.Range(1, 1000000);

s.RemoveRandomly1(75000); => 50ms
l.RemoveRandomly2(75000); => 925000ms

For most typical needs a list would do though..!

Community
  • 1
  • 1
nawfal
  • 70,104
  • 56
  • 326
  • 368
  • 1
    You don't quantify your results. We have no idea what "faster" means without some numbers. (for example, you don't warm up the VM, so it's hard to know whether that's the reason for your observed behavior) – Kirk Woll Nov 24 '12 at 18:18
  • @KirkWoll +1, I will add a couple of in short time. – nawfal Nov 24 '12 at 18:23
  • @KirkWoll I added few results to the question.. – nawfal Nov 24 '12 at 19:19

1 Answers1

4

First off, IList and ISet aren't implementations of anything. I can write an IList or an ISet implementation that will run very differently, so the concrete implementations are what is important (List and HashSet in your case).

Accessing a List item by index is O(1) but not removing by RemoveAt which is O(n).

List removing from the end will be fast because it doesn't have to copy anything, it just decrements its internal counter that stores how many items it has until the number of empty spots in the underlying array goes below a threshold, at which point it will copy the array to a smaller one. Once you hit the max capacity of the underlying array it creates a new array double the size and copies the elements over. If you go below a certain threshold it will create an array half the size and copy the elements over. It tracks how large it is with a length property, so that unused slots appear like they aren't there.

Randomly removing from a list means that it will have to copy all the array entries that come after the index so that they slide down one spot, which is inherently pretty slow, particularly as the size of the list gets bigger. If you have a List with 1 million entries, and you remove something at index 500,000, it has to copy the second half of the array down a spot.

Community
  • 1
  • 1
Mike Marynowski
  • 3,156
  • 22
  • 32
  • 1
    From MSDN docs on the remove method: This method is an O(n) operation, where n is (Count - index). – Mike Marynowski Nov 24 '12 at 18:28
  • That's a good observation, you have got some reference? I can not believe an `IList` will have to copy array after an index when an item is removed from that index. That's so unkind! :) – nawfal Nov 24 '12 at 18:29
  • `Remove` might be, but I am doing `RemoveAt`. That sure has to be O(1)! – nawfal Nov 24 '12 at 18:30
  • It's just a dynamically sized array. Think about it - how would you remove an item at a random index of an array without copying the items that come after it down a spot? – Mike Marynowski Nov 24 '12 at 18:31
  • Once you hit the max capacity of the underlying array it creates a new array double the size and copies the elements over. If you go below a certain threshold it will create an array half the size and copy the elements over. It tracks how large it is with a length property, so that unused slots appear like they aren't there. – Mike Marynowski Nov 24 '12 at 18:33
  • ok I got it. But your comment `From MSDN docs on the removeat method: This method is an O(n) operation, where n is (Count - index)` is not correct I believe. Can you post the link for it where it says so? – nawfal Nov 24 '12 at 18:35
  • I have edited your answer to incorporate a few points from your comments. Accepted :) – nawfal Nov 24 '12 at 19:18