I am trying to figure out how to filter a very large set of documents based on keyword matching.
I have 20+million entries with ID and (several) text fields in my SQL database and I want to get all IDs for which the text matches a set of keywords. This includes more complex expressions like:
(term1 NEAR term2 NEAR term3) AND NOT "A phrase" AND @fieldXYZ "wildcards%aswell*"
The results do not need to be scored, sorted or ranked in any way.
From what I understand the power of Lucene/Solr, Sphinx and ElasticSearch is to give back the TOP documents super fast but they are not really intended to give back ALL documents.
I know that it is possible to do this with a custom Collector in Lucene (see What's the most efficient way to retrieve all matching documents from a query in Lucene, unsorted?) and possibly with Cursors/Scrolling in Solr/Elasticsearch but I am wondering if there is any other technology that is specifically optimized for this problem?