27

I have a very large list of integers (about 2 billion elements) and a list with indices (couple thousand elements) at which I need to remove elements from the first list. My current approach is to loop over all indices in the second list, passing each to the RemoveAt() method of the first list:

indices.Sort();
indices.Reverse();
for (i = 0; i < indices.Count; i++)
{
    largeList.RemoveAt(indices[i]);
}

However, it takes about 2 minutes to finish. I really need to perform this operation much faster. Is there any way of optimizing this?

I have a Intel i9X CPU with 10 cores, so maybe some way of parallel processing?

Lance U. Matthews
  • 15,725
  • 6
  • 48
  • 68
N K
  • 445
  • 4
  • 5
  • 3
    So... this *2 billion items* list - is it in memory or on disk? What is the intended result of this operation - a new list? a file on disk? – CoolBots Aug 19 '20 at 21:40
  • Parallel Linq can help you : https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/introduction-to-plinq – vernou Aug 19 '20 at 21:41
  • 1
    I've added code that i'm currently using. List is in the memory and as a result I also need a list in memory for further use in code. – N K Aug 19 '20 at 21:44
  • Why `indices.Reverse()`? – CoolBots Aug 19 '20 at 21:45
  • 4
    That's a large list... by which I mean: you're getting close to he vector limit even with GC-large-objects enabled. What are the characteristics of this data? Unique? Sorted? Does order matter? Could it be sensibly partitioned? If you can partition: you can parallelize. Otherwise, maybe some alternative storage format...? You can't "simply" parallelize, because of synchronization problems – Marc Gravell Aug 19 '20 at 21:45
  • 4
    @Vernou I don't think Parallel LINQ will help - `List` is not thread safe. – NetMage Aug 19 '20 at 21:46
  • 1
    @CoolBots Because when you remove the item at an indice, you will shift down all following items, so you want to remove from the end so indices won't reference the wrong items. – NetMage Aug 19 '20 at 21:47
  • At a minimum I would suggest reversing the `for` instead of using `List.Reverse()`. Does the final order of the large list matter? – NetMage Aug 19 '20 at 21:47
  • 1
    In particular, I'm wondering if you could just use a 250MiB bit mask to cover the 2 billion range; then you're just clearing bits by position, much quicker than shuffling - and much smaller data size too. Alternatively, there are some sparse vector formats, although frankly 2 billion integers is almost fully dense! – Marc Gravell Aug 19 '20 at 21:50
  • Items in largeList are not unique (that's why I'm using second list with indices) and not sorted. Also, the order of them has to be preserved. – N K Aug 19 '20 at 21:53
  • 2
    Maybe you could accumulate the effects of all those `RemoveAt()` calls in a single pass by iterating through each index of `largeList` starting at `indices[0]`? Everything from `indices[0] + 1` through `indices[1] - 1` gets shifted back one position using the indexer, everything from `indices[1] + 1` (pre-removal of `indices[0]`) through `indices[2] - 1` gets shifted back two positions, and so on. Then at the end call `RemoveRange()` to trim off the end of the list the same number of elements that were "removed". You'll have to double-check the math on that, but that's the general idea. – Lance U. Matthews Aug 19 '20 at 22:08
  • Related: [Remove list elements at given indices](https://stackoverflow.com/questions/9908564/remove-list-elements-at-given-indices) – Theodor Zoulias Aug 19 '20 at 23:30
  • 2
    The question is..... why do you keep 2 billions element in memory? That's about ~8G in memory for a list of int. How about using a database for dealing with data? – MLeblanc Aug 20 '20 at 00:08
  • 1
    What additional focus does this question require? A specific definition of "much faster"? If the votes are keying on the "if your question has many valid answers..." part of the close description, then couldn't _any_ [tag:performance] question fail that criteria? How should the author know how many ways there are to make their code faster? Don't ignore the "(but no way to determine which - if any - are correct)` that follows, though; there _is_ a way to evaluate the correctness — or, at least, merit — of an answer here: benchmarks, which is exactly what I did in my answer! – Lance U. Matthews Aug 24 '20 at 19:29
  • Another possibility is overwriting the deletion with an illegal value. If the requirements show that negative values are illegal, just changing the sign of an entry to 'remove' it may make overall operation faster, depending on the ratio of lookups to inserts/deletions. An occasional sweep, using one of the other suggested removal mechanisms, may be needed to trim the list size. – Martin James Oct 03 '20 at 06:09
  • 4
    This question has been under meta effect in early October 2020: https://meta.stackoverflow.com/questions/401743/why-did-this-performance-question-warrant-not-only-closure-but-deletion-too – Cœur Oct 04 '20 at 03:01
  • 3
    I would consider iterating over 2 billion items in 2 minutes *extremely fast*. What do you consider to be acceptable speed here? By what objective metric can a person consider an answer to acceptably solve your problem? Voting to close as 'needs details' until that is clarified, at least. – TylerH Oct 05 '20 at 17:53

6 Answers6

18

Each time RemoveAt() is called it has to move every element that comes after the specified index up by one element. If there are thousands of elements to remove in a very large list, that will result in many, many (needless) moves.

My thinking was that if you can calculate the start index and length of each move, you can process the list in a single pass with no overlapping moves. This is done by comparing adjacent elements in the list of indices to remove. While this does mean building a third List<> of move operations to perform, my hope is having the most efficient, minimal move strategy planned out in advance will pay off in the end (or perhaps there's a way to do that without allocating any objects that presently isn't occurring to me).

You can see my implementation, RemoveUsingMoveStrategy(), in the benchmark code below. Running the launcher program below in test mode, we can see it — as well as the other answers — produces the correct result when given indices 0, 1, 5, 10, 11, 15, 15 (repeated), 18, and 19 to remove from a 20-element List<int>...

PS> dotnet run --configuration release --framework netcoreapp3.1 -- test
[snip]
RemoveUsingMoveStrategy():
        Elements: { 2, 3, 4, 6, 7, 8, 9, 12, 13, 14, 16, 17 }
           Count: 12
        Sequence: Equal to control list
        Duration: 3.95 milliseconds
        Returned: Input list
[snip]

Benchmark notes

  • It was my intention to benchmark against a billion-element List<int>, but RemoveUsingRemoveAt() — inspired by the code in the question — is so inefficient that it was taking too long, so I only went up to 10 million elements.

  • I still hope to run a billion-element benchmark that excludes RemoveUsingRemoveAt(). For that reason, I introduced a not-quite-as-naive implementation called RemoveUsingListCopy() to serve as the baseline for comparison across all list sizes. As the name implies, it doesn't modify the input list but creates a new one with the removals applied.

  • Since not all implementations require the list of indices to remove to be unique and/or sorted, I figured the fairest way to benchmark would be to include that overhead in the benchmark time where the method requires it.

  • The only other changes I made to code from other answers was to make use of the benchmark's lists (DataList and RemovalList) and to change from var to explicit types for reader clarity.

  • RemovalListLocation indicates from where in DataList indices were removed.

    • For Beginning, Middle, and End it is a contiguous RemovalListSize-sized block removed from that location.
    • For Random it is RemovalListSize random, valid, unsorted, not-guaranteed-to-be-unique indices generated from a constant seed.

    To keep the results short(er), I opted to only benchmark the Middle — thinking it'd be a good middle-of-the-road compromise — and Random values.

Benchmark results

  • RemoveUsingRemoveAt() is terrible. Don't do that.

  • For the smaller list, RemoveUsingListCopy() is consistently the fastest, at the expense of doubling memory usage.

  • For the larger list, Vernou's answer is as least 5x as fast as the other answers, at the expense of relying on implementation details.

    This just goes to show that you shouldn't necessarily always favor List<> over an array — unless you need its additional capabilities — because it isn't always superior. It insulates the underlying array, preventing you (without reflection) from using more performant access methods like Array.Copy() and unsafe code on it.

  • Theodor Zoulias's answer does well with the smaller list, but I think having to "touch" all 10 million indices — each with a call to HashSet<>.Contains() and increment of an index variable — really hurt it with the larger list.

  • The remaining implementations (including mine) have roughly the same performance: not the best, but still pretty good.

Benchmark data

Running the launcher program defined later in this answer in benchmark mode, I get these results from BenchmarkDotNet...

PS> dotnet run --configuration release --framework netcoreapp3.1 -- benchmark
[snip]
// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.450 (2004/?/20H1)
Intel Core i7 CPU 860 2.80GHz (Nehalem), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.401
  [Host]    : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  MediumRun : .NET Framework 4.8 (4.8.4200.0), X64 RyuJIT

Job=MediumRun  InvocationCount=1  IterationCount=15  
LaunchCount=2  UnrollFactor=1  WarmupCount=10  

|                        Method |       Runtime | DataListSize | RemovalListSize | RemovalListLocation |           Mean |         Error |        StdDev |         Median | Ratio | RatioSD |
|------------------------------ |-------------- |------------- |---------------- |-------------------- |---------------:|--------------:|--------------:|---------------:|------:|--------:|
|           RemoveUsingListCopy |      .NET 4.8 |        10000 |            2000 |              Random |       379.9 μs |      15.93 μs |      23.36 μs |       380.4 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |        10000 |            2000 |              Random |     1,463.7 μs |      93.79 μs |     137.47 μs |     1,434.7 μs |  3.87 |    0.41 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |        10000 |            2000 |              Random |       635.9 μs |      19.77 μs |      29.60 μs |       624.0 μs |  1.67 |    0.14 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |        10000 |            2000 |              Random |       372.1 μs |      11.52 μs |      16.15 μs |       373.8 μs |  0.99 |    0.07 |
|       Answer63496768_CoolBots |      .NET 4.8 |        10000 |            2000 |              Random |       594.5 μs |      25.13 μs |      36.03 μs |       593.1 μs |  1.57 |    0.12 |
|        Answer63495657_NetMage |      .NET 4.8 |        10000 |            2000 |              Random |       618.8 μs |      17.53 μs |      23.99 μs |       622.4 μs |  1.65 |    0.13 |
|         Answer63496256_Vernou |      .NET 4.8 |        10000 |            2000 |              Random |       645.6 μs |      27.28 μs |      39.99 μs |       632.2 μs |  1.71 |    0.16 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |        10000 |            2000 |              Random |       391.5 μs |      10.39 μs |      15.55 μs |       391.6 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |        10000 |            2000 |              Random |     1,402.2 μs |      44.20 μs |      64.80 μs |     1,407.6 μs |  3.59 |    0.21 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |        10000 |            2000 |              Random |       557.9 μs |      19.73 μs |      27.00 μs |       557.2 μs |  1.43 |    0.10 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |        10000 |            2000 |              Random |       424.3 μs |      20.90 μs |      29.30 μs |       424.2 μs |  1.09 |    0.09 |
|       Answer63496768_CoolBots | .NET Core 3.1 |        10000 |            2000 |              Random |       535.0 μs |      19.37 μs |      27.16 μs |       537.1 μs |  1.37 |    0.08 |
|        Answer63495657_NetMage | .NET Core 3.1 |        10000 |            2000 |              Random |       557.7 μs |      18.73 μs |      25.63 μs |       550.0 μs |  1.43 |    0.09 |
|         Answer63496256_Vernou | .NET Core 3.1 |        10000 |            2000 |              Random |       554.2 μs |      13.82 μs |      18.45 μs |       554.0 μs |  1.42 |    0.07 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |        10000 |            2000 |              Middle |       221.6 μs |       7.25 μs |      10.63 μs |       222.5 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |        10000 |            2000 |              Middle |     1,195.3 μs |      20.01 μs |      28.69 μs |     1,187.7 μs |  5.42 |    0.30 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |        10000 |            2000 |              Middle |       405.0 μs |      13.65 μs |      19.14 μs |       410.7 μs |  1.83 |    0.10 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |        10000 |            2000 |              Middle |       206.3 μs |       8.62 μs |      12.09 μs |       204.9 μs |  0.94 |    0.08 |
|       Answer63496768_CoolBots |      .NET 4.8 |        10000 |            2000 |              Middle |       427.5 μs |      15.56 μs |      22.81 μs |       435.4 μs |  1.93 |    0.13 |
|        Answer63495657_NetMage |      .NET 4.8 |        10000 |            2000 |              Middle |       405.4 μs |      13.80 μs |      19.35 μs |       403.8 μs |  1.84 |    0.11 |
|         Answer63496256_Vernou |      .NET 4.8 |        10000 |            2000 |              Middle |       413.9 μs |      15.26 μs |      20.89 μs |       419.8 μs |  1.87 |    0.12 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |        10000 |            2000 |              Middle |       235.2 μs |      10.73 μs |      15.73 μs |       236.2 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |        10000 |            2000 |              Middle |     1,345.6 μs |      32.07 μs |      43.90 μs |     1,352.7 μs |  5.77 |    0.41 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |        10000 |            2000 |              Middle |       324.0 μs |       4.92 μs |       7.05 μs |       326.6 μs |  1.39 |    0.09 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |        10000 |            2000 |              Middle |       262.9 μs |       6.18 μs |       9.06 μs |       265.4 μs |  1.12 |    0.08 |
|       Answer63496768_CoolBots | .NET Core 3.1 |        10000 |            2000 |              Middle |       333.6 μs |      10.14 μs |      13.87 μs |       340.9 μs |  1.43 |    0.11 |
|        Answer63495657_NetMage | .NET Core 3.1 |        10000 |            2000 |              Middle |       313.5 μs |       9.05 μs |      12.69 μs |       310.5 μs |  1.34 |    0.11 |
|         Answer63496256_Vernou | .NET Core 3.1 |        10000 |            2000 |              Middle |       332.3 μs |       6.70 μs |       8.95 μs |       331.9 μs |  1.43 |    0.09 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |     10000000 |            2000 |              Random |   253,977.1 μs |   2,721.70 μs |   3,989.43 μs |   253,809.0 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |     10000000 |            2000 |              Random | 5,191,083.4 μs |  13,200.66 μs |  18,931.99 μs | 5,187,162.3 μs | 20.43 |    0.34 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |     10000000 |            2000 |              Random |    65,365.4 μs |     422.41 μs |     592.16 μs |    65,307.3 μs |  0.26 |    0.00 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |     10000000 |            2000 |              Random |   240,584.4 μs |   3,687.89 μs |   5,048.03 μs |   244,336.1 μs |  0.95 |    0.02 |
|       Answer63496768_CoolBots |      .NET 4.8 |     10000000 |            2000 |              Random |    54,168.4 μs |   1,001.37 μs |   1,436.14 μs |    53,390.3 μs |  0.21 |    0.01 |
|        Answer63495657_NetMage |      .NET 4.8 |     10000000 |            2000 |              Random |    72,501.4 μs |     452.46 μs |     634.29 μs |    72,161.2 μs |  0.29 |    0.00 |
|         Answer63496256_Vernou |      .NET 4.8 |     10000000 |            2000 |              Random |     5,814.0 μs |      89.71 μs |     128.67 μs |     5,825.3 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |     10000000 |            2000 |              Random |   239,784.0 μs |   2,721.35 μs |   3,902.88 μs |   241,125.5 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |     10000000 |            2000 |              Random | 5,538,337.5 μs | 353,505.30 μs | 495,565.06 μs | 5,208,226.1 μs | 23.12 |    2.15 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |     10000000 |            2000 |              Random |    33,071.8 μs |     103.80 μs |     138.57 μs |    33,030.5 μs |  0.14 |    0.00 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |     10000000 |            2000 |              Random |   240,825.5 μs |     851.49 μs |   1,248.11 μs |   240,520.9 μs |  1.00 |    0.02 |
|       Answer63496768_CoolBots | .NET Core 3.1 |     10000000 |            2000 |              Random |    26,265.0 μs |      90.76 μs |     124.23 μs |    26,253.0 μs |  0.11 |    0.00 |
|        Answer63495657_NetMage | .NET Core 3.1 |     10000000 |            2000 |              Random |    48,670.6 μs |     581.51 μs |     833.99 μs |    48,303.0 μs |  0.20 |    0.00 |
|         Answer63496256_Vernou | .NET Core 3.1 |     10000000 |            2000 |              Random |     5,905.5 μs |      96.27 μs |     131.78 μs |     5,915.1 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |     10000000 |            2000 |              Middle |   153,776.2 μs |   2,454.90 μs |   3,674.38 μs |   152,872.0 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |     10000000 |            2000 |              Middle | 5,245,952.0 μs |  13,845.58 μs |  20,294.67 μs | 5,252,922.4 μs | 34.10 |    0.81 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |     10000000 |            2000 |              Middle |    33,233.6 μs |     110.33 μs |     158.24 μs |    33,217.3 μs |  0.22 |    0.01 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |     10000000 |            2000 |              Middle |   128,949.8 μs |     560.72 μs |     804.17 μs |   128,724.9 μs |  0.84 |    0.02 |
|       Answer63496768_CoolBots |      .NET 4.8 |     10000000 |            2000 |              Middle |    48,965.1 μs |      70.75 μs |      94.45 μs |    48,957.3 μs |  0.32 |    0.01 |
|        Answer63495657_NetMage |      .NET 4.8 |     10000000 |            2000 |              Middle |    32,641.5 μs |      66.85 μs |      91.51 μs |    32,610.0 μs |  0.21 |    0.01 |
|         Answer63496256_Vernou |      .NET 4.8 |     10000000 |            2000 |              Middle |     2,982.2 μs |      29.47 μs |      41.31 μs |     2,961.9 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |     10000000 |            2000 |              Middle |   144,208.7 μs |   2,035.88 μs |   2,984.16 μs |   142,693.2 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |     10000000 |            2000 |              Middle | 5,235,957.7 μs |  13,674.19 μs |  20,043.46 μs | 5,241,536.1 μs | 36.32 |    0.78 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |     10000000 |            2000 |              Middle |    16,547.3 μs |      72.72 μs |     101.95 μs |    16,520.7 μs |  0.11 |    0.00 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |     10000000 |            2000 |              Middle |   137,218.2 μs |     716.45 μs |     980.69 μs |   137,027.0 μs |  0.95 |    0.02 |
|       Answer63496768_CoolBots | .NET Core 3.1 |     10000000 |            2000 |              Middle |    23,728.5 μs |      79.84 μs |     111.93 μs |    23,689.9 μs |  0.16 |    0.00 |
|        Answer63495657_NetMage | .NET Core 3.1 |     10000000 |            2000 |              Middle |    17,298.3 μs |     216.46 μs |     310.44 μs |    17,165.5 μs |  0.12 |    0.00 |
|         Answer63496256_Vernou | .NET Core 3.1 |     10000000 |            2000 |              Middle |     2,999.7 μs |      85.78 μs |     123.03 μs |     2,957.1 μs |  0.02 |    0.00 |
[snip]

Benchmark code

To see the various implementations, scroll a third of the way down and look for the methods decorated with [Benchmark()].

This requires the BenchmarkDotNet package.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;

namespace SO63495264
{
    public class Benchmarks
    {
        public enum RemovalLocation
        {
            Random,
            Beginning,
            Middle,
            End
        }

        [Params(10_000, 10_000_000)]
        public int DataListSize
        {
            get; set;
        }

        [Params(2_000)]
        public int RemovalListSize
        {
            get; set;
        }

        //[ParamsAllValues()]
        [Params(RemovalLocation.Random, RemovalLocation.Middle)]
        public RemovalLocation RemovalListLocation
        {
            get; set;
        }

        public List<int> DataList
        {
            get;
            set;
        }

        public IReadOnlyList<int> RemovalList
        {
            get;
            set;
        }

        [GlobalSetup()]
        public void GlobalSetup()
        {
            IEnumerable<int> removalIndices;

            if (RemovalListLocation == RemovalLocation.Random)
            {
                // Pass a constant seed so the same indices get generated for every run
                Random random = new Random(12345);

                removalIndices = Enumerable.Range(0, RemovalListSize)
                    .Select(i => random.Next(DataListSize));
            }
            else
            {
                int removalSegmentOffset = RemovalListLocation switch {
                    RemovalLocation.Beginning => 0,
                    RemovalLocation.Middle => (DataListSize - RemovalListSize) / 2,
                    RemovalLocation.End => DataListSize - RemovalListSize,
                    _ => throw new Exception($"Unexpected {nameof(RemovalLocation)} enumeration value {RemovalListLocation}.")
                };

                removalIndices = Enumerable.Range(removalSegmentOffset, RemovalListSize);
            }

            // For efficiency, create a single List<int> to be reused by all iterations for a given set of parameters
            DataList = new List<int>(DataListSize);
            RemovalList = removalIndices.ToList().AsReadOnly();
        }

        [IterationSetup()]
        public void IterationSetup()
        {
            // Each iteration could modify DataList, so repopulate its elements for the
            // next iteration.  DataList is either new or has been Clear()ed by IterationCleanup().
            for (int i = 0; i < DataListSize; i++)
                DataList.Add(i);
        }

        [IterationCleanup()]
        public void IterationCleanup()
        {
            DataList.Clear();
        }

        [GlobalCleanup()]
        public void GlobalCleanup()
        {
            // Force collection of the List<> for the current set of parameters
            int generation = GC.GetGeneration(DataList);

            DataList = null;
            GC.Collect(generation, GCCollectionMode.Forced, true);
        }

        [Benchmark(Baseline = true)]
        public List<int> RemoveUsingListCopy()
        {
            HashSet<int> removalSet = RemovalList.ToHashSet();
            List<int> newList = new List<int>(DataList.Count - removalSet.Count);

            for (int index = 0; index < DataList.Count; index++)
                if (!removalSet.Contains(index))
                    newList.Add(DataList[index]);

            return newList;
        }

        // Based on Stack Overflow question 63495264
        // https://stackoverflow.com/q/63495264/150605
        [Benchmark()]
        public List<int> RemoveUsingRemoveAt()
        {
            // The collection of indices to remove must be in decreasing order
            // with no duplicates; all are assumed to be in-range for DataList
            foreach (int index in RemovalList.Distinct().OrderByDescending(i => i))
                DataList.RemoveAt(index);

            return DataList;
        }

        // From Stack Overflow answer 63498191
        // https://stackoverflow.com/a/63498191/150605
        [Benchmark()]
        public List<int> RemoveUsingMoveStrategy()
        {
            // The collection of indices to remove must be in increasing order
            // with no duplicates; all are assumed to be in-range for DataList
            int[] coreIndices = RemovalList.Distinct().OrderBy(i => i).ToArray();
            List<(int startIndex, int count, int distance)> moveOperations
                = new List<(int, int, int)>();

            int candidateIndex;
            for (int i = 0; i < coreIndices.Length; i = candidateIndex)
            {
                int currentIndex = coreIndices[i];
                int elementCount = 1;

                // Merge removal of consecutive indices into one operation
                while ((candidateIndex = i + elementCount) < coreIndices.Length
                    && coreIndices[candidateIndex] == currentIndex + elementCount
                )
                {
                    elementCount++;
                }

                int nextIndex = candidateIndex < coreIndices.Length
                    ? coreIndices[candidateIndex]
                    : DataList.Count;
                int moveCount = nextIndex - currentIndex - elementCount;

                // Removing one or more elements from the end of the
                // list will result in a no-op, 0-length move; omit it
                if (moveCount > 0)
                {
                    moveOperations.Add(
                        (
                            startIndex: currentIndex + elementCount,
                            count: moveCount,
                            distance: i + elementCount
                        )
                    );
                }
            }

            // Perform the element moves
            foreach ((int startIndex, int count, int distance) in moveOperations)
            {
                for (int offset = 0; offset < count; offset++)
                {
                    int sourceIndex = startIndex + offset;
                    int targetIndex = sourceIndex - distance;

                    DataList[targetIndex] = DataList[sourceIndex];
                }
            }

            // "Trim" the number of removed elements from the end of the list
            DataList.RemoveRange(DataList.Count - coreIndices.Length, coreIndices.Length);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63496225 revision 4
        // https://stackoverflow.com/revisions/63496225/4
        [Benchmark()]
        public List<int> Answer63496225_TheodorZoulias()
        {
            HashSet<int> indicesSet = new HashSet<int>(RemovalList);
            int index = 0;

            DataList.RemoveAll(_ => indicesSet.Contains(index++));

            return DataList;
        }


        // Adapted from Stack Overflow answer 63496768 revision 2
        // https://stackoverflow.com/revisions/63496768/2
        [Benchmark()]
        public List<int> Answer63496768_CoolBots()
        {
            List<int> sortedIndicies = RemovalList.Distinct().OrderBy(i => i).ToList();

            int sourceStartIndex = 0;
            int destStartIndex = 0;
            int spanLength = 0;

            int skipCount = 0;

            // Copy items up to last index to be skipped
            foreach (int skipIndex in sortedIndicies)
            {
                spanLength = skipIndex - sourceStartIndex;
                destStartIndex = sourceStartIndex - skipCount;

                for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
                {
                    DataList[destStartIndex] = DataList[i];
                    destStartIndex++;
                }

                sourceStartIndex = skipIndex + 1;
                skipCount++;
            }

            // Copy remaining items (between last index to be skipped and end of list)
            spanLength = DataList.Count - sourceStartIndex;
            destStartIndex = sourceStartIndex - skipCount;

            for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
            {
                DataList[destStartIndex] = DataList[i];
                destStartIndex++;
            }

            DataList.RemoveRange(destStartIndex, sortedIndicies.Count);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63495657 revision 6
        // https://stackoverflow.com/revisions/63495657/6
        [Benchmark()]
        public List<int> Answer63495657_NetMage()
        {
            List<int> removeAtList = RemovalList.Distinct().OrderBy(i => i).ToList();

            int srcCount = DataList.Count;
            int ralCount = removeAtList.Count;
            int removeAtIndice = 1;
            int freeIndex = removeAtList[0];
            int current = freeIndex + 1;
            while (current < srcCount)
            {
                while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice])
                {
                    ++current;
                    ++removeAtIndice;
                }

                if (current < srcCount)
                    DataList[freeIndex++] = DataList[current++];
            }

            DataList.RemoveRange(freeIndex, srcCount - freeIndex);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63496256 revision 3
        // https://stackoverflow.com/revisions/63496256/3
        [Benchmark()]
        public List<int> Answer63496256_Vernou()
        {
            List<int> indices = RemovalList.Distinct().OrderBy(i => i).ToList();

            //Get the internal array
            int[] largeArray = (int[]) typeof(List<int>)
                .GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
                .GetValue(DataList);

            int current = 0;
            int copyFrom = 0;

            for (int i = 0; i < indices.Count; i++)
            {
                int copyTo = indices[i];
                if (copyTo < copyFrom)
                {
                    //In case the indice is duplicate,
                    //The item is already passed
                    continue;
                }
                int copyLength = copyTo - copyFrom;
                Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
                current += copyLength;
                copyFrom = copyTo + 1;
            }
            //Resize the internal array
            DataList.RemoveRange(DataList.Count - indices.Count, indices.Count);

            return DataList;
        }
    }
}

Launcher code

RunBenchmark() defines the type of benchmark job to run and for which runtime(s).

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace SO63495264
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args?.Length == 0)
            {
                string assemblyFilePath = Assembly.GetExecutingAssembly().Location;
                string assemblyFileName = System.IO.Path.GetFileName(assemblyFilePath);

                Console.WriteLine($"{assemblyFileName} {{ benchmark | test }}");
            }
            else if (string.Equals(args[0], "benchmark", StringComparison.OrdinalIgnoreCase))
                RunBenchmark();
            else if (string.Equals(args[0], "test", StringComparison.OrdinalIgnoreCase))
                RunTest();
            else
                Console.WriteLine($"Unexpected parameter \"{args[0]}\"");
        }

        static void RunBenchmark()
        {
            Job baseJob = Job.MediumRun;
            IConfig config = DefaultConfig.Instance;

            foreach (Runtime runtime in new Runtime[] { CoreRuntime.Core31, ClrRuntime.Net48 })
            {
                config = config.AddJob(
                    baseJob.WithRuntime(runtime)
                );
            }

            BenchmarkDotNet.Running.BenchmarkRunner.Run<Benchmarks>(config);
        }

        static void RunTest()
        {
            const int ListSize = 20;
            const int MaxDisplayElements = 20;

            IEnumerable<int> data = Enumerable.Range(0, ListSize);
            IReadOnlyList<int> indices = new List<int>(
                new int[] {
                    0,                               1, // First two indices
                    ListSize / 4,
                    ListSize / 2,     ListSize / 2 + 1, // Consecutive indices
                    ListSize / 4 * 3, ListSize / 4 * 3, // Duplicate indices
                    ListSize - 2,     ListSize - 1      // Last two indices
                }
            ).AsReadOnly();

            // Discover and invoke the benchmark methods the same way BenchmarkDotNet would
            Benchmarks benchmarks = new Benchmarks() {
                RemovalList = indices
            };
            IEnumerable<MethodInfo> benchmarkMethods = benchmarks.GetType()
                .GetMethods(BindingFlags.Instance | BindingFlags.Public)
                .Where(
                    method => method.CustomAttributes.Any(
                        attribute => attribute.AttributeType == typeof(BenchmarkAttribute)
                    )
                );

            // Call a known-good method to get the correct results for comparison
            benchmarks.DataList = data.ToList();
            List<int> controlList = benchmarks.RemoveUsingListCopy();

            foreach (MethodInfo benchmarkMethod in benchmarkMethods)
            {
                List<int> inputList = data.ToList();

                benchmarks.DataList = inputList;
                Stopwatch watch = Stopwatch.StartNew();
                List<int> outputList = (List<int>) benchmarkMethod.Invoke(benchmarks, Array.Empty<object>());
                watch.Stop();

                Console.WriteLine($"{benchmarkMethod.Name}():");
                Console.WriteLine($"\tElements: {{ {string.Join(", ", outputList.Take(MaxDisplayElements))}{(outputList.Count > MaxDisplayElements ? ", ..." : "") } }}");
                Console.WriteLine($"\t   Count: {outputList.Count:N0}");
                Console.WriteLine($"\tSequence: {(outputList.SequenceEqual(controlList) ? "E" : "*Not* e")}qual to control list");
                Console.WriteLine($"\tDuration: {watch.Elapsed.TotalMilliseconds:N2} milliseconds");
                Console.WriteLine($"\tReturned: {(object.ReferenceEquals(outputList, inputList) ? "Input list" : "New list")}");
            }
        }
    }
}

In order to benchmark on .NET Framework (4.8) I had to add the following properties to my .NET Core .csproj project file:

<TargetFrameworks>netcoreapp3.1;net48</TargetFrameworks>
<LangVersion>8.0</LangVersion>

41 characters to spare!

Lance U. Matthews
  • 15,725
  • 6
  • 48
  • 68
17

Since the order of the source list is significant, you can move each item down the list, skipping indices to be removed, and then remove the end of the list.

UPDATE: Took the .Net Core source code for RemoveAll and modified it for a indice list instead of a predicate.

UPDATE 2: Optimized to not repeat tests if possible.

UPDATE 3: Simplified as optimization having extra code proved slower in benchmarks.

Having src as the large list, and removeAtList as the indices to remove in some random order, you can do:

removeAtList.Sort();

var srcCount = src.Count;
var ralCount = removeAtList.Count;
var removeAtIndice = 1;
var freeIndex = removeAtList[0];
var current = freeIndex+1;
while (current < srcCount) {
    while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice]) {
        ++current;
        ++removeAtIndice;
    }

    if (current < srcCount)
        src[freeIndex++] = src[current++];
}

src.RemoveRange(freeIndex, srcCount-freeIndex);

For a one billion element list of random integers, and a 1000 - 3000 element list of random indices, I get 1.1 ms per remove with this algorithm. Using RemoveAt, I get over 232.77 ms per remove, so this is about 200 times faster.

NetMage
  • 26,163
  • 3
  • 34
  • 55
  • So, are you saying your total running time for `1_000_000_000` values and, say, 2000 indicies to remove would be 1.14ms x 2000 = 2.28s? I'm getting ~26 seconds with your solution... which is still much faster than anything I came up with so far, lol! – CoolBots Aug 19 '20 at 23:46
  • @CoolBots Yes - it is taking about 12 seconds to create the list, then about 3.2 seconds to remove 2000 indices - my per millisecond time is verying from 1.2 to 3 ms using .Net Core 3.1 in LINQPad 6. – NetMage Aug 20 '20 at 00:07
  • @CoolBots I updated my answer by modifying the code from .Net Core 3.1 `List.RemoveAll` for a skip indices list and optimizing. (PS LINQPad 6 x64.) – NetMage Aug 20 '20 at 00:14
  • It's faster than it was, but I can't get to the times you're claiming. I get about 12 seconds runtime for your version. I just came up with a solution that runs in under 5 seconds, will post. – CoolBots Aug 20 '20 at 00:26
  • @CoolBots Nice. That could also be because I am running on a i7-4790 with 32GB of RAM under Windows 10 x64 – NetMage Aug 20 '20 at 00:27
  • Interesting, your spec is better than mine, especially RAM - can you try my version on your machine? – CoolBots Aug 20 '20 at 00:37
6

One way to allow this to be parallelized would be to break the list into multiple fragments; perhaps initially (arbitrarily) separate slabs of 1 million elements. As long as each slab maintains its own count, you can split the work by index into removals from different slabs (based purely on the counts), and then do the actual removal work concurrently. If you leave some spare capacity in each, you can also add elements into the middle more cheaply, as you are usually only touching one slab. Random access will be a little slower, as you may need to look at multiple slab counts to determine the correct slab, but if the slab counts are maintained in a contiguous vector (rather than against each slab), you should have excellent memory cache hit while doing it.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Or to parallelize with the same flat linear data structure: If you have a sorted unique list of indices to remove, you can calculate the output index for a given input index: starting at `indices[i]`, there will have been `i` removals before this. If `indices[]` is small, you can scan through it to distribute block copying work across threads, and have all threads copy into a single array. – Peter Cordes Oct 05 '20 at 08:03
6

This answer is based on other answers here - mainly, I am shifting elements up within the list, as suggested by @Vernou (in their answer) and @BACON (in comments). This one is finally performant (unlike my first few approaches), and is faster than other solutions posted so far, at least in my tests - I tried OP's setup of 2_000_000_000 entries and 2_000 indicies - runtime is under 10 seconds on my laptop (i7-8550U @ 1.8GHz, 16GB RAM):

static void FilterOutIndicies(List<int> values, List<int> sortedIndicies)
{
    int sourceStartIndex = 0;
    int destStartIndex = 0;
    int spanLength = 0;

    int skipCount = 0;

    // Copy items up to last index to be skipped
    foreach (var skipIndex in sortedIndicies)
    {
        spanLength = skipIndex - sourceStartIndex;
        destStartIndex = sourceStartIndex - skipCount;

        for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
        {
            values[destStartIndex] = values[i];
            destStartIndex++;
        }

        sourceStartIndex = skipIndex + 1;
        skipCount++;
    }

    // Copy remaining items (between last index to be skipped and end of list)
    spanLength = values.Count - sourceStartIndex;
    destStartIndex = sourceStartIndex - skipCount;

    for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
    {
        values[destStartIndex] = values[i];
        destStartIndex++;
    }

    values.RemoveRange(destStartIndex, sortedIndicies.Count);
}
CoolBots
  • 4,770
  • 2
  • 16
  • 30
  • 1
    With a fixed `Random`, I get 1.22 ms for mine and 0.86 ms for yours, very nice! – NetMage Aug 20 '20 at 00:49
  • A quick check says my CPU should be about 23% faster than yours. – NetMage Aug 20 '20 at 00:55
  • @NetMage thanks for checking my version on your PC. Yeah, I think it's the faster CPU and double the RAM that's really helping, due to the array size. Maybe time for an upgrade for me, lol! – CoolBots Aug 20 '20 at 03:05
  • On my computer, NetMage solution take 22 seconds, CoolBots solution take 15 seconds. My solution take 2 seconds, but I use reflection. All solutions are equivalent in memory usage. – vernou Aug 20 '20 at 08:48
  • 1
    The semi-last “Copy remaining items” step is obsolete. You just need to use the right range for the `RemoveRange` call. – Holger Oct 06 '20 at 08:03
4

The method List.RemoveAt copy all next items from the removed item. In your case, this copy 2,000 * 2,000,000,000 times each items (not really, but the true is near).

A solution is manually copy item between the removed item and the next removed item :

static void Main(string[] args)
{
    var largeList = Enumerable.Range(0, 2_000_000_000).ToList();

    var indices = new List<int>();
    var rand = new Random();
    for (var i = 0; i < 20000; i++)
    {
        indices.Add(rand.Next(0, largeList.Count - 1));
    }
    indices.Sort();

    var watch = new Stopwatch();
    watch.Start();

    // You can convert the list to array with ToArray,
    // but this duplicate the memory use.
    // Or get the internal array by reflection,
    // but reflection on external library isn't recommended
    var largeArray = (int[])typeof(List<int>)
        .GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
        .GetValue(largeList);

    var current = 0;
    var copyFrom = 0;

    for (var i = 0; i < indices.Count; i++)
    {
        var copyTo = indices[i];
        if (copyTo < copyFrom)
        {
            //In case the indice is duplicate,
            //The item is already passed
            continue;
        }
        var copyLength = copyTo - copyFrom;
        Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
        current += copyLength;
        copyFrom = copyTo + 1;
    }
    //Resize the internal array
    largeList.RemoveRange(largeList.Count - indices.Count, indices.Count);

    watch.Stop();
    Console.WriteLine(watch.Elapsed);
    Console.WriteLine(largeList.Count);
}
vernou
  • 6,818
  • 5
  • 30
  • 58
  • It looks like correct implementation if OP starts with an array. Unfortunately as is it copies array couple extra times and hence require 3x memory which in case of 2 billion items array is non-trivial. Converting `Array.Copy` to regular `for` would be the solution. If performance of this code is ok than it would be better to just copy items into new list and skip indexes in the "remove" array - which will have just 2x memory consumption. – Alexei Levenkov Aug 19 '20 at 23:30
  • `Array.Copy` is really more performant, because is implemented by JIT. More information in this question : https://stackoverflow.com/questions/6558266/how-is-array-copy-implemented-in-c – vernou Aug 20 '20 at 07:09
  • 1
    @Alexei-Levenkov`List` is encapsulated `Array`. I edited my answer to get the internal array. – vernou Aug 20 '20 at 07:38
  • 1
    @Vernou that's a really interesting approach. Mixed feelings about it, lol, but definitely interesting! – CoolBots Aug 20 '20 at 14:45
  • 1
    @CoolBots That is what I figured would be the only way to speed yours up, but I definitely wouldn't use it in production :) – NetMage Aug 20 '20 at 18:57
  • Yes, don't use this in prod. In prod, it better to use directly `Array` or convert the list. – vernou Aug 20 '20 at 20:16
  • 3
    I have updated the benchmark in [my answer](https://stackoverflow.com/a/63498191/150605) to include your code. For the larger (10-million element) list I benchmarked against, the other answers generally took 10-20% of the time of the baseline implementation; yours tooks 5-10% of the time _of those other answers_ (i.e. 2% of the baseline). – Lance U. Matthews Aug 23 '20 at 06:06
  • 1
    https://meta.stackoverflow.com/questions/401782/should-answers-that-are-just-for-fun-or-encourage-poor-or-dangerous-practices – Hans Passant Oct 04 '20 at 21:55
  • Is there no other way to get current .NET compilers / JITs to do an efficient `memcpy` over a block of elements? In-place copy is clearly very good for performance, and needs to be done with SIMD vectors to keep up with memory on modern CPUs. I don't know C#, but using reflection to get at the internals of the container seems like an unfortunate requirement. If so, you should highlight that danger in the answer text. BTW, a C++ implementation would be able to use `std::copy` or even `memmove` between parts of a `std::vector` and have it compile efficiently. – Peter Cordes Oct 05 '20 at 08:13
  • I edited the answer to indicate the danger of the reflection on external library. – vernou Oct 05 '20 at 09:46
  • With .Net Core 5.0 it looks like `List.AsSpan` is part of `CollectionMarshal`, which should make this possible without Reflection. – NetMage Oct 05 '20 at 18:53
-1

When you have multiple items to remove from a List, and replacing the List with a new List is not an option, the most efficient way is to use the RemoveAll method instead of the RemoveAt. The RemoveAll rearranges the internal state of the List only once, instead of doing it once per each item removed.

The RemoveAll accepts a Predicate<T> that is invoked once for each item in the list (the large list). Unfortunately this delegate doesn't receive the index of the currently tested item. You could however depend on knowing how the RemoveAll is implemented. The source code reveals that the items are tested sequentially in ascending order. So based on this knowledge you could remove the selected indices from the list, very efficiently, with this three-liner:

HashSet<int> indicesSet = new(indices);
int index = 0;
largeList.RemoveAll(_ => indicesSet.Contains(index++));

This behavior is not documented explicitly though. In theory Microsoft could change the behavior of the RemoveAll method in a future .NET release, breaking the above code. Personally I consider this scenario to be almost improbable, but if you want to be on the safe side you could use a custom RemoveAll implementation that has fixed behavior, like the one found in this answer, which also features a delegate with index. You could use it like this:

HashSet<int> indicesSet = new(indices);
largeList.RemoveAll((_, index) => indicesSet.Contains(index));
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104