0

I have an in memory list with a lot of objects (let's say 150000). Each object has a string property that I want to search/filter on, something like this:

var searchTerm = "something";
var result = listOfObjects.Where(o => o.Prop.Contains(searchTerm)).ToList();

This is obviusly very slow. Is there any way to speed it up? I've tried processing in parallel without any benefits. Is there any way to involve hashsets? Or perhaps sort it and do a binary search?

Johan
  • 35,120
  • 54
  • 178
  • 293
  • Check this answer here. It might help. https://stackoverflow.com/questions/1009107/what-net-collection-provides-the-fastest-search – Igor Dimchevski Jun 20 '17 at 16:01
  • Can you provide some timings? How much does it take now? How much does it take with parallel (like `AsParallel`)? How much time do you need it to take? – Evk Jun 20 '17 at 16:02
  • Any solution depends on your particular usage pattern. How often is the source list changed versus how often is it queried? How long are the typical source strings? Are you searching for whole words, or exact substrings? What is acceptably quick? – Tim Rogers Jun 20 '17 at 16:03

2 Answers2

0

Here is what I tried, granted my data is a bit different since I just generated random strings to test filtering on. But here is my example code.

class Program
{
    static void Main(string[] args)
    {
        Start:

        List<Test> TestList = new List<Test>();
        int ObjectsToCreate = 1000000;
        Console.WriteLine($"Creating {ObjectsToCreate} Objects!");
        for (int x = 1; x <= ObjectsToCreate; x++)
        {
            TestList.Add(new Test() { Name = RandomString(100) });
        }
        Console.WriteLine($"Created {TestList.Count} objects.");
        string StringToSearchFor = "A";
        Console.WriteLine($"Benchmarking Now");
        System.Diagnostics.Stopwatch Watch = System.Diagnostics.Stopwatch.StartNew();

        var TestCollection = TestList.Where(Item => Item.Name.Contains(StringToSearchFor));
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With Where Into VAR: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With Where Into VAR: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        IEnumerable<Test> TestCollection_ = TestList.Where(Item => Item.Name.Contains(StringToSearchFor));
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With Where Into IEnumerable<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With Where Into IEnumerable<Test>: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        List<Test> TestCollection2 = TestList.Where(Item => Item.Name.Contains(StringToSearchFor)).ToList();
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With Where Into List<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With Where Into List<Test>: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        List<Test> TestCollection3 = TestList.AsParallel().Where(Item => Item.Name.Contains(StringToSearchFor)).ToList();
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With AsParallel First Where Into List<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With AsParallel First Where Into List<Test>: {Watch.ElapsedTicks} ticks");

        Watch = System.Diagnostics.Stopwatch.StartNew();
        List<Test> TestCollection4 = TestList.Where(Item => Item.Name.Contains(StringToSearchFor)).AsParallel().ToList();
        Watch.Stop();
        Console.WriteLine($"Elapsed Time With AsParallel Last Where Into List<Test>: {Watch.ElapsedMilliseconds}ms");
        Console.WriteLine($"Elapsed Time With AsParallel Last Where Into List<Test>: {Watch.ElapsedTicks} ticks");

        Console.ReadLine();
        goto Start;
    }

    public class Test
    {
        public string Name { get; set; }
    }

    private static Random random = new Random();
    public static string RandomString(int length)
    {
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        return new string(Enumerable.Repeat(chars, length)
          .Select(s => s[random.Next(s.Length)]).ToArray());
    }
}

And here is the output I get from running it.

Creating 1000000 Objects!
Created 1000000 objects.
Benchmarking Now
Elapsed Time With Where Into VAR: 0ms
Elapsed Time With Where Into VAR: 192 ticks
Elapsed Time With Where Into IEnumerable<Test>: 0ms
Elapsed Time With Where Into IEnumerable<Test>: 4 ticks
Elapsed Time With Where Into List<Test>: 273ms
Elapsed Time With Where Into List<Test>: 934287 ticks
Elapsed Time With AsParallel First Where Into List<Test>: 164ms
Elapsed Time With AsParallel First Where Into List<Test>: 564069 ticks
Elapsed Time With AsParallel Last Where Into List<Test>: 192ms
Elapsed Time With AsParallel Last Where Into List<Test>: 658852 ticks

If I run this same test several times the results of putting that data into VAR goes down to about 7-8 ticks on my machine but exporting to IEnumerable goes down to about 2-3. And this is with 1 million objects. So I'm a bit confused as to what you would define as "very slow". Unless I'm completely misunderstanding something.

Edit: My example with VAR and IEnumerable are not as valid as I originally thought, see the comments on my answer below.

ECourant
  • 108
  • 1
  • 14
  • You miss that `TestList.Where(...)` without materializing (like `ToList()`) does not actually searches anything, hence 0ms. – Evk Jun 20 '17 at 16:12
  • @Evk That explains quite a bit, so the is the data then actually retrieved when the object that I'm assigning that data to from the `Where()` is referenced? – ECourant Jun 20 '17 at 16:15
  • 1
    `Where` is actually executed when you enumerate the result, such as calling `ToList()` or `First()` or just enumerating it in foreach loop. – Evk Jun 20 '17 at 16:16
  • So `List TestCollection3 = TestList.AsParallel().Where(Item => Item.Name.Contains(StringToSearchFor)).ToList();` would actually be the best representation of filtering the data into a proper object the fastest (atleast in my example)? Glad to learn something today. – ECourant Jun 20 '17 at 16:19
0

There are some things I can think about.

  • At first start with the prediacte you want to test on. Contains is pretty expensive, compared to StartsWith and EndsWith. Therefore, use a performance optimal predicate.
  • Parallelize the filter process. Hint Parallel Namespace
  • Apply a function (hash) to all objects and insert them into a Dictionary. (To enable O(1) access time) Add some other properties (salt) to the hash if the "key" does occur more than once.
  • Consider using a database

Furthermore, it strongly depends on your object structure. Does the object provide any additional information we could use for the comparison? If yes:

  • Chunk the data set into groups (depending on property values) and start searching (parallel) in the group with the highest probability. (Heuristic function)
  • Use more "complex" structures, like trees, to minimize the data amount

Depending on your program's behaviour, avoid loading all data sets. Load (based on a heuristic value) only a specific amount of data sets (lets say 10.000) and use a displacement strategy if the value isn't preset, to fetch new data.

Bin4ry
  • 652
  • 9
  • 34