10

I need to be able to search over a collection of approx 2 million items in C#. Search should be possible over multiple fields. Simple string-matching is good enough.

Using an external dependency like a database is not an option, but using an in-memory database would be OK.

The main goal is to do this memory-efficient.

The type in the collection is quite simple and has no long strings:

public class Item
{
    public string Name { get; set; } // Around 50 chars
    public string Category { get; set; } // Around 20 chars
    public bool IsActive { get; set; }
    public DateTimeOffset CreatedAt { get; set; }
    public IReadOnlyList<string> Tags { get; set; } // 2-3 items
}

Focus and requirements

Clarification of focus and requirements:

  • No external dependencies (like a database)
  • Memory-efficient (below 2 GB for 2 million items)
  • Searchable items in collection (must be performant)

Today's non-optimal solution

Using a simple List<T> over above type, either as a class or a struct, still requires about 2 GB of memory.

Is there a better way?

Seb Nilsson
  • 26,200
  • 30
  • 103
  • 130
  • 1
    How large is each `Item`? – Robert Harvey Jun 11 '20 at 13:58
  • related: [I have a list with over a million objects in it, trying to find the fastest way to search through it](https://stackoverflow.com/questions/20462716/i-have-a-list-with-over-a-million-objects-in-it-trying-to-find-the-fastest-way) – Drag and Drop Jun 11 '20 at 14:04
  • 7
    Please read [What is the X Y problem](https://meta.stackexchange.com/a/66378) and edit your post accordingly. – Robert Harvey Jun 11 '20 at 14:15
  • Use a dictionary which uses a binary hash so lookup is log2(N) instead of N/2. – jdweng Jun 11 '20 at 14:24
  • 3
    The 2 millions objects will dominate the memory requirements of the program, rather than the list containing them. Can you load it by chunks instead of all at once? – Alejandro Jun 11 '20 at 14:31
  • Are you looking for the best way to do what? A quick search? Compact storage? – Alexander Petrov Jun 11 '20 at 15:21
  • I would suggest Hot/Cold data splitting. This will give you the maximum search speed due to the efficient use of the processor cache. – Alexander Petrov Jun 11 '20 at 15:26
  • How many `Tags` has each `Item` on average? Also how many different `Tags` exist in total? – Theodor Zoulias Jun 11 '20 at 16:49
  • Hello all. Thanks for all the great follow up questions. I've tried to clarify the question as much as possible, accordingly. – Seb Nilsson Jun 11 '20 at 21:51
  • Inserting 2 million `Item`s in `List` did not consume 2GB but instead 260ish MB. – JohanP Jun 15 '20 at 07:08
  • @JohanP It would be great if that was so. Even in an isolated test, just adding 2 million fake items (classes/struct, not primitives) took the memory to 2GB for me. What could be going on? – Seb Nilsson Jun 15 '20 at 07:39
  • 1
    Memory should be around 500M (or your numbers are not the real ones, like you put long strings in tags for example). The real question is search. Optimizing search depends on what/when/how do you want to search. Of course you could build one or more index, and that costs memory. Note that you could use a database such as SQLite which can completely reside in memory. It has index features, and also full text and Rtree features. Plus it's now shipped with Windows 10 and Azure (winsqlite3.dll in \windows\system32) – Simon Mourier Jun 15 '20 at 09:09
  • @SimonMourier Searching is easy with simple (dumb) LINQ-queries :) But SQLite is a good idea. I’ll try that and see what happens. – Seb Nilsson Jun 15 '20 at 10:03
  • Its worth noting that the order you list your members in will/should affect the memory layout. Personally, i think if you need such fine grained control over your memory layout, i'd considering writing at least this part of your program in a different language. I love C#, but it prefers ease of use over (for lack of a better term) programmer choice. – Taekahn Jun 15 '20 at 16:58
  • 4
    This question is lacking important information. Is this a one time search, where is this data stored, define Simple string-matching, is this case sensitive, if you store the data randonly accessable and back by a file - would an indexing structure be ok. – TheGeneral Jun 18 '20 at 04:56
  • @TheGeneral Stored in memory. Multiple searches. Everything else is flexible, to find the best solution possible. – Seb Nilsson Jun 18 '20 at 06:08
  • In C# the initial capacity of a List is 0(zero) but as you add items to the list, the capacity of the internal array will double as more items are added, so 4, 8, 16, 32, 64, etc. This is an implementation detail, and may change with each framework update. To minimize the memory footprint you can use an Array of tags instead of a List. – Alexander Pope Jun 18 '20 at 17:49
  • 2
    Also, why is using an external database not possible? Searching large data sets is the very reason why they were created. I assume you don't care about fast searches because with a list you are now getting O(n) or worse depending on how the query is defined. – Alexander Pope Jun 18 '20 at 17:54
  • 1
    How, often are the 2 million records changing? or are they static? – TheGeneral Jun 19 '20 at 06:29

6 Answers6

6

The most significant memory hog in your class is the use of a read-only list. Get rid of it and you will reduce memory footprint by some 60% (tested with three tags):

public class Item
{
    public string Name { get; set; }
    public string Category { get; set; }
    public bool IsActive { get; set; }
    public DateTimeOffset CreatedAt { get; set; }
    public string Tags { get; set; } // Semi-colon separated
}

Also, consider using DateTime instead of DateTimeOffset. That will further reduce memory footprint with around 10%.

l33t
  • 18,692
  • 16
  • 103
  • 180
4

There are many things you can do in order to reduce the memory footprint of your data, but probably the easiest thing to do with the greatest impact would be to intern all strings. Or at least these that you expect to be repeated a lot.

// Rough example (no checks for null values)
public class Item
{
    private string _name;
    public string Name
    {
        get { return _name; }
        set { _name = String.Intern(value); }
    }

    private string _category;
    public string Category
    {
        get { return _category; }
        set { _category = String.Intern(value); }
    }

    public bool IsActive { get; set; }
    public DateTimeOffset CreatedAt { get; set; }

    private IReadOnlyList<string> _tags;
    public IReadOnlyList<string> Tags
    {
        get { return _tags; }
        set { _tags = Array.AsReadOnly(value.Select(s => String.Intern(s)).ToArray()); }
    }
}

Another thing you could do, more difficult and with smaller impact, would be to assign the same IReadOnlyList<string> object to items with identical tags (assuming that many items with identical tags exist in your data).


Update: Also don't forget to call TrimExcess to the list after you fill it with items, in order to get rid of the unused capacity.

This method can be used to minimize a collection's memory overhead if no new elements will be added to the collection.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
3

With 2 GB (i.e. 2 billion bytes) for 2 million items, we have 1000 bytes per item, which should be more than enough to do this in polynomial time.

If I understand your requirements correctly, you have 2 million instances of a complex type, and you want to match complete strings / string prefixes / string infixes in any of their fields. Is that correct? I'm going to assume the hardest case, searching infixes, i.e. any part of any string.

Since you have not provided a requirement that new items be added over time, I am going to assume this is not required.

You will need to consider how you want to compare. Are there cultural requirements? Or is ordinal (i.e. byte-by-byte) comparison acceptable?

With that out of the way, let's get into an answer.

Browsers do efficient in-memory text search for web pages. They use data structures like Suffix Trees for this. A suffix tree is created once, in linear time time linear in the total word count, and then allows searches in logarithmic time time linear in the length of the word. Although web pages are generally smaller than 2 GB, linear creation and logarithmic searching scale very well.

  • Find or implement a Suffix Tree.
  • The suffix tree allows you to find substrings (with time complexity O(log N) O(m), where m is the word length) and get back the original objects they occur in.
  • Construct the suffix tree once, with the strings of each object pointing back to that object.
  • Suffix trees compact data nicely if there are many common substrings, which tends to be the case for natural language.
  • If a suffix tree turns out to be too large (unlikely), you can have an even more compact representation with a Suffix Array. They are harder to implement, however.

Edit: On memory usage

As the data has more common prefixes (e.g. natural language), a suffix tree's memory usage approaches the memory required to store simply the strings themselves.

For example, the words fire and firm will be stored as a parent node fir with two leaf nodes, e and m, thus forming the words. Should the word fish be introduced, the node fir will be split: a parent node fi, with child nodes sh and r, and the r having child nodes e and m. This is how a suffix tree forms a compressed, efficiently searchable representation of many strings.

With no common prefixes, there would simply be each of the strings. Clearly, based on the alphabet, there can only be so many unique prefixes. For example, if we only allow characters a through z, then we can only have 26 unique first letters. A 27th would overlap with one of the existing words' first letter and thus get compacted. In practice, this can save lots of memory.

The only overhead comes from storing separate substrings and the nodes that represent and connect them.

Timo
  • 7,992
  • 4
  • 49
  • 67
  • will suffix tree handle search of all desired fields of complex object? eg Name Tags fields should has separate trees? – Vladimir Shmidt Jul 05 '20 at 10:46
  • @VladimirShmidt That is up to you. You should use a single suffix tree. You search for [partial] words in that tree. Any field whose words you store in the suffix tree is practically included. The same goes for the resulting "value" associated with those words: You decide what value to insert for a particular key. It's going to be something that links back to the original source object that you got the word from, such as its ID. – Timo Jul 05 '20 at 14:20
1

You can do theses dots, then you will see if there is trouble:


Do you need to mutate properties, or just search object in the collection?

If you just want research, and if your properties repeat themselves a lot, you can have one property used by many objects.

In this way, the value is stored only one time, and the object only store the reference.

You can do this only if you dont want to mutate the property.

As exemple, if two objects have the same category:

public class Category
{
    public string Value { get; }

    public Category(string category)
    {
        Value = category;
    }
}

public class Item
{
    public string Name { get; set; }
    public Category Category { get; set; }
    public bool IsActive { get; set; }
    public DateTimeOffset CreatedAt { get; set; }
    public IReadOnlyList<string> Tags { get; set; }
}


class Program
{
    public void Init()
    {
        Category category = new Category("categoryX");

        var obj1 = new Item
        {
            Category = category
        };

        var obj2 = new Item
        {
            Category = category
        };
    }
}
Thibaut
  • 342
  • 1
  • 7
1

I would not expect any major memory issues with 2M objects if you are running 64-bits. There is a max size limit of lists of 2Gb, but a reference is only 8 bytes, so the list should be well under this limit. The total memory usage will depend mostly on how large the strings are. There will also be some object overhead, but this is difficult to avoid if you need to store multiple strings.

Also, how do you measure memory? The .Net runtime might over allocate memory, so the actual memory usage of your object might be significantly lower than the memory reported by windows. Use a memory profiler to get an exact count.

If strings are duplicated between many objects there might be a major win if you can deduplicate them, making use of the same instance.

using a struct instead of a class could avoid some overhead, so I made some tests:

  • list of objects using LINQ - 46ms
  • list of objects using for loop - 16ms
  • list of structs using for loop - 250ms
  • list of readonly structs with ref-return using for loop: 180ms

The exact times will depend on what query you are doing, these numbers are mostly for comparison.

Conclusion is that a regular List of objects with a regular for loop is probably the fastest. Also, iterating over all objects is quite fast, so in most cases it should not cause a major performance issue.

If you need better performance you will need to create some kind of index so you can avoid iterating over all items. Exact strategies for this is difficult to know without knowing what kinds of queries you are doing.

One option could be to use some variant of in memory database, this could provide most of the indexing functionality. SQLite would be one example

JonasH
  • 28,608
  • 2
  • 10
  • 23
0

If the categories could be defined as an Enum, you can map it to bits, that would help in reducing the size pretty much. From 20bytes to say 2bytes(short int), this could approximately save around 36M bytes for 2M objects.

Adithya
  • 19
  • 1