Searching for multiple partial phrases so that one original phrase can not match multiple search phrases

Question

Given a predefined set of phrases, I'd like to perform a search based on user's query. For example, consider the following set of phrases:

index      phrase
-----------------------------------------
0          Stack Overflow
1          Math Overflow
2          Super User
3          Webmasters
4          Electrical Engineering
5          Programming Jokes
6          Programming Puzzles
7          Geographic Information Systems

The expected behaviour is:

query         result
------------------------------------------------------------------------
s             Stack Overflow, Super User, Geographic Information Systems
web           Webmasters
over          Stack Overflow, Math Overflow
super u       Super User
user s        Super User
e e           Electrical Engineering
p             Programming Jokes, Programming Puzzles
p p           Programming Puzzles

To implement this behaviour I used a trie. Every node in the trie has an array of indices (empty initially).

To insert a phrase to the trie, I first break it to words. For example, Programming Puzzles has index = 6. Therefore, I add 6 to all the following nodes:

p
pr
pro
prog
progr
progra
program
programm
programmi
programmin
programming
pu
puz
puzz
puzzl
puzzle
puzzles

The problem is, when I search for the query prog p, I first get a list of indices for prog which is [5, 6]. Then, I get a list of indices for p which is [5, 6] as well. Finally, I calculate the intersection between the two, and return the result [5, 6], which is obviously wrong (should be [6]).

How would you fix this?

so it's your library and you would like to make changes to it so it handles this case correctly, right? — Miroshko, May 08 '15 at 12:34
So what should be the result for `p prog`? I'm assuming zero matches, right? If so, I'd say the way forward would be to attach a word-index on each node-label. — aioobe, May 08 '15 at 12:38
@aioobe For `p prog` we should get `Programming Puzzles`. In general, the result for `word1 word2 ... wordN` and for any permutation like `wordN word2 word1 ...`, should be the same. — Misha Moroshko, May 08 '15 at 12:41
@MishaMoroshko, then it seems like a really hard problem to solve. If all permutations should give the same result (search term position is irrelevant) but words may not be "reused" as they are in your example, then it seems to me that you could reduce the [Set Cover Problem](http://en.wikipedia.org/wiki/Set_cover_problem) to your problem, which means that your problem is NP-hard an can only be solved using brute force. (Maybe I'm wrong or misunderstood something, but that's my immediate feeling.) — aioobe, May 08 '15 at 12:49
@aioobe Main difference from set cover is you have a candidate for the covering set, and you don't need to "invent" the covering set, or give a minimal one. Verifying if a given set is indeed a set cover can be done efficiently (definition of NP), and I am not sure it is verification of set cover in the first place, but that's debateable — amit, May 11 '15 at 09:50
How many phrases and words you would expect to store and how many words could be expected in one phrase? And would they all be stored in the browser? — גלעד ברקן, May 17 '15 at 02:21
@גלעדברקן ~20000 phrases, ~3-7 words in each phrase. It will be run in node.js on the server. — Misha Moroshko, May 17 '15 at 09:58

amit · Answer 1 · 2015-05-11T10:06:43.183

You can solve it as a Graph Matching Problem in a Bipartite Graph.

For each document, query pair define the graph:

G=(V,E) Where
V = {t1 | for each term t1 in the query} U { t2 | for each term t2 in the document}
E = { (t1,t2) | t1 is a match for t2 }

Intuitively: you have a vertex for each term in the query, a vertex for each term in the document, and an edge between a document term and a query term, only if the query term matches the document term. You have already solved this part with your trie.

You got yourself a bipartite graph, there are only edges between the "query vertices" and the "document vertices" (and not between two vertices of the same type).

Now, invoke a matching problem for bipartite graph, and get an optimal matching {(t1_1,t2_1), ... , (t1_k,t2_k)}.

Your algorithm should return a document d for a query q with m terms in the query, if (and only if) all m terms are satisfied, which means - you have maximal matching where k=m.

In your example, the graph for query="prog p", and document="Programming Jokes", you will get the bipartite graph with the matching: (or with Programming,p matched - doesn't matter which)

enter image description here

And, for the same query, and document="Programming Puzzles", you will get the bipartite graph with the matching:

enter image description here

As you can see, for the first example - there is no matching that covers all the terms, and you will "reject" the document. For the 2nd example - you were able to match all terms, and you will return it.

For performance issues, you can do the suggested algorithm only on a subset of the phrases, that were already filtered out by your initial approach (intersection of documents that have matching for all terms).

This is a perfect solution; the two problems are mutually reducible. Note: You can also pre-filter when the number of search terms exceeds the number of phrase words. — Kittsil, May 11 '15 at 19:44
Downvoter: Why do you think the answer is no helpful? Mind elaborating so I can learn? — amit, May 13 '15 at 06:18
I didn't downvote you, but I can explain why your solution is clearly not the best. This algorithm requires storing in the trie not only references to all matched phrases, but also references to all matched words, what's more, for each of matched phrases. So memory complexity is a lot worse now. Bipartite matching is also not free, its complexity using traditional implementations in worst case is at least O(n*m), where n is # of query words and m is # of phrase words. And you do it for each matched phrase, so in worst case you need to multiply this time complexity by number of phrases. — dened, May 13 '15 at 11:59

dened · Accepted Answer · 2015-05-16T11:19:46.113

Key Observation

We can use the fact that two words in a query can match the same word in a phrase only if one query word is a prefix of the other query word (or if they are same). So if we process the query words in descending lexicographic order (prefixes come after their "superwords"), then we can safely remove words from the phrases at the first match. Doing so we left no possibility to match the same phrase word twice. As I said, it is safe because prefixes match superset of phrase words what their "superwords" can match, and pair of query words, where one is not a prefix of the other, always match disjoint set of phrase words.

We don't have to remove words from phrases or the trie "physically", we can do it "virtually".

Implementation of the Algorithm

var PhraseSearch = function () {   
    var Trie = function () {
        this.phraseWordCount = {};
        this.children = {};
    };

    Trie.prototype.addPhraseWord = function (phrase, word) {
        if (word !== '') {
            var first = word.charAt(0);

            if (!this.children.hasOwnProperty(first)) {
                this.children[first] = new Trie();
            }
            var rest = word.substring(1);
            this.children[first].addPhraseWord(phrase, rest);
        }
        if (!this.phraseWordCount.hasOwnProperty(phrase)) {
            this.phraseWordCount[phrase] = 0;
        }
        this.phraseWordCount[phrase]++;
    };

    Trie.prototype.getPhraseWordCount = function (prefix) {
        if (prefix !== '') {
            var first = prefix.charAt(0);

            if (this.children.hasOwnProperty(first)) {
                var rest = prefix.substring(1);
                return this.children[first].getPhraseWordCount(rest);
            } else {
                return {};
            }
        } else {
            return this.phraseWordCount;
        }
    }

    this.trie = new Trie();
}

PhraseSearch.prototype.addPhrase = function (phrase) {
    var words = phrase.trim().toLowerCase().split(/\s+/);
    words.forEach(function (word) {
        this.trie.addPhraseWord(phrase, word);
    }, this);
}

PhraseSearch.prototype.search = function (query) {
    var answer = {};
    var phraseWordCount = this.trie.getPhraseWordCount('');
    for (var phrase in phraseWordCount) {
        if (phraseWordCount.hasOwnProperty(phrase)) {
            answer[phrase] = true;
        }
    }

    var prefixes = query.trim().toLowerCase().split(/\s+/);

    prefixes.sort();
    prefixes.reverse();

    var prevPrefix = '';
    var superprefixCount = 0;

    prefixes.forEach(function (prefix) {
        if (prevPrefix.indexOf(prefix) !== 0) {
            superprefixCount = 0;
        }
        phraseWordCount = this.trie.getPhraseWordCount(prefix);

        function phraseMatchedWordCount(phrase) {
            return phraseWordCount.hasOwnProperty(phrase) ? phraseWordCount[phrase] - superprefixCount : 0;
        }

        for (var phrase in answer) {
            if (answer.hasOwnProperty(phrase) && phraseMatchedWordCount(phrase) < 1) {
                delete answer[phrase];
            }
        }

        prevPrefix = prefix;
        superprefixCount++;
    }, this);

    return Object.keys(answer);
}

function test() {
    var phraseSearch = new PhraseSearch();

    var phrases = [
        'Stack Overflow',
        'Math Overflow',
        'Super User',
        'Webmasters',
        'Electrical Engineering',
        'Programming Jokes',
        'Programming Puzzles',
        'Geographic Information Systems'
    ];

    phrases.forEach(phraseSearch.addPhrase, phraseSearch);

    var queries = {
        's':       'Stack Overflow, Super User, Geographic Information Systems',
        'web':     'Webmasters',
        'over':    'Stack Overflow, Math Overflow',
        'super u': 'Super User',
        'user s':  'Super User',
        'e e':     'Electrical Engineering',
        'p':       'Programming Jokes, Programming Puzzles',
        'p p':     'Programming Puzzles'
    };

    for(var query in queries) {
        if (queries.hasOwnProperty(query)) {
            var expected = queries[query];
            var actual = phraseSearch.search(query).join(', ');

            console.log('query: ' + query);
            console.log('expected: ' + expected);
            console.log('actual: ' + actual);
        }
    }
}

One can test this code here: http://ideone.com/RJgj6p

Possible Optimizations

Storing the phrase word count in each trie node is not very memory efficient. But by implementing compressed trie it is possible to reduce the worst case memory complexity to O(n m), there n is the number of different words in all the phrases, and m is the total number of phrases.
For simplicity I initialize answer by adding all the phrases. But a more time efficient approach is to initialize answer by adding the phrases matched by the query word matching least number of phrases. Then intersect with the phrases of the query word matching second least number of phrases. And so on...

Relevant Differences from the Implementation Referenced in the Question

In trie node I store not only the phrase references (ids) matched by the subtrie, but also the number of matched words in these phrases. So, the result of the match is not only the matched phrase references, but also the number of matched words in these phrases.
I process query words in descending lexicographic order.
I subtract the number of superprefixes (query words of which the current query word is a prefix) from current match results (by using variable superprefixCount), and a phrase is considered matched by the current query word only when the resulting number of matched words in it is greater than zero. As in the original implementation, the final result is the intersection of the matched phrases.

As one can see, changes are minimal and asymptotic complexities (both time and memory) are not changed.

Thanks a lot for your proposal. Trying to follow your code, I can't see where `this.firstWord + this.wordCount` are coming from. Is that legacy that can be removed? — Misha Moroshko, May 16 '15 at 04:00
@MishaMoroshko, sure! It's a legacy from my early experiments I forgot to remove. I cleaned up the code. — dened, May 16 '15 at 05:44
Thanks! Could you also provide some pointers about "possible optimizations"? — Misha Moroshko, May 16 '15 at 05:53
@MishaMoroshko, I added some more info regarding possible optimizations. — dened, May 16 '15 at 09:51
(+1) At first I thought I did not understand your answer, but then I realized the idea I came up with was very similar. — גלעד ברקן, May 17 '15 at 15:38

Fefer_Ivan · Answer 3 · 2015-05-12T13:25:54.847

If the set of phrases is defined and does not contain long phrases, maybe you can create not 1 trie, but n tries, where n is the maximum number of words in one phrase.

In i-th trie store i-th word of the phrase. Let's call it the trie with label 'i'.

To process query with m words let's consider the following algorithm:

For each phrase we will store the lowest label of a trie, where the word from this phrase was found. Let's denote it as d[j], where j is the phrase index. At first for each phrase j, d[j] = -1.
Search the first word in each of n tries.
For each phrase j find the label of a trie that is greater than d[j] and where the word from this phrase was found. If there are several such labels, pick the smallest one. Let's denote such label as c[j].
If there is no such index, this phrase can not be matched. You can mark this case with d[j] = n + 1.
If there is such c[j] that c[j] > d[j], than assign d[j] = c[j].
Repeat for every word left.
Every phrase with -1 < d[j] < n is matched.

This is not very optimal. To improve performance you should store only usable values of d array. After first word, store only phrases, matched with this word. Also, instead of assignment d[j] = n + 1, delete index j. Process only already stored phrase indexes.

I am not following how are you planning to do the "matching" between a query term and document term. Specifically - what exactly you are trying to do on (4) is unclear. You broke everything else to details, but that seems to be the core of the approach, and it's blurry. — amit, May 12 '15 at 07:51
@Fefer_Ivan you algorithm doesn't work for permuted queries, is it? For example, `o s` doesn't match `Stack Overflow`. Probably, you need to sort both phrase and query words first? But it still will be not an optimal solution. — dened, May 15 '15 at 07:38

גלעד ברקן · Answer 4 · 2015-05-17T20:49:50.653

1

After some thought I came up with a similar idea to dened's - in addition to the index of a matched phrase, each prefix will refer to how many words it is a prefix of in that phrase - then that number can be reduced in the query process by the number of its superfixes among other query words, and the returned results include only those with at least the same number of matched words as the query.

We can implement an additional small tweak to avoid large cross-checks by adding (for the English language) a maximum of approximately 26 choose 2 + 26 choose 3 and even an additional 26 choose 4 special elements to the trie that refer to ordered first-letter intersections. When a phrase is inserted, the special elements in the trie referring to the 2 and 3 first-letter combinations will receive its index. Then match results from larger query words can be cross-checked against these. For example, if our query is "Geo i", the match results for "Geo" would be cross-checked against the special trie element, "g-i", which hopefully would have significantly less match results than "i".

Also, depending on the specific circumstances, large cross-checks could at times be more efficiently handled in parallel (for example, via a bitset &).

edited May 17 '15 at 20:49

answered May 17 '15 at 19:28

גלעד ברקן

23,602
3
25
61

+1. Isn't it more intuitive and more simple to just preliminary initialize the result (`answer` in my solution) with the precalculated answer for a combination of 2/3/4 different initial letters of the query terms? But either way this tweak won't help when all the query terms start with the same letter, will it? – dened May 18 '15 at 05:42
Or more generally, it won't help if the letter combination match too many phrases. – dened May 18 '15 at 07:31
But I just noticed, that probably you already indicated this in your answer by using word 'hopefully'. :) – dened May 18 '15 at 07:47
@dened I used the term "approximately" 26 choose 2 to include the combinations with repeated first-letters. Yes, the first-letter combination is probably more likely to include less references than single letters, but is not guaranteed to include less than longer prefixes. What do you mean by "initialize the result...with the precalculated answer?" – גלעד ברקן May 18 '15 at 14:22
For example, in my solution, `answer` initially contains all the phrases, and while processing query words I remove unmatched phrases. But we can start with the phrases matched by a first-letter combination instead. And it is possible to make precalculations for all the combinations beforehand. – dened May 18 '15 at 15:30

Searching for multiple partial phrases so that one original phrase can not match multiple search phrases

4 Answers4

Key Observation

Implementation of the Algorithm

Possible Optimizations

Relevant Differences from the Implementation Referenced in the Question