With 2 GB (i.e. 2 billion bytes) for 2 million items, we have 1000 bytes per item, which should be more than enough to do this in polynomial time.
If I understand your requirements correctly, you have 2 million instances of a complex type, and you want to match complete strings / string prefixes / string infixes in any of their fields. Is that correct? I'm going to assume the hardest case, searching infixes, i.e. any part of any string.
Since you have not provided a requirement that new items be added over time, I am going to assume this is not required.
You will need to consider how you want to compare. Are there cultural requirements? Or is ordinal (i.e. byte-by-byte) comparison acceptable?
With that out of the way, let's get into an answer.
Browsers do efficient in-memory text search for web pages. They use data structures like Suffix Trees for this. A suffix tree is created once, in linear time time linear in the total word count, and then allows searches in logarithmic time time linear in the length of the word. Although web pages are generally smaller than 2 GB, linear creation and logarithmic searching scale very well.
- Find or implement a Suffix Tree.
- The suffix tree allows you to find substrings (with time complexity
O(log N) O(m), where m is the word length) and get back the original objects they occur in.
- Construct the suffix tree once, with the strings of each object pointing back to that object.
- Suffix trees compact data nicely if there are many common substrings, which tends to be the case for natural language.
- If a suffix tree turns out to be too large (unlikely), you can have an even more compact representation with a Suffix Array. They are harder to implement, however.
Edit: On memory usage
As the data has more common prefixes (e.g. natural language), a suffix tree's memory usage approaches the memory required to store simply the strings themselves.
For example, the words fire
and firm
will be stored as a parent node fir
with two leaf nodes, e
and m
, thus forming the words. Should the word fish
be introduced, the node fir
will be split: a parent node fi
, with child nodes sh
and r
, and the r
having child nodes e
and m
. This is how a suffix tree forms a compressed, efficiently searchable representation of many strings.
With no common prefixes, there would simply be each of the strings. Clearly, based on the alphabet, there can only be so many unique prefixes. For example, if we only allow characters a
through z
, then we can only have 26 unique first letters. A 27th would overlap with one of the existing words' first letter and thus get compacted. In practice, this can save lots of memory.
The only overhead comes from storing separate substrings and the nodes that represent and connect them.