Word-wise super string search for given string

Question

For any input string, we need to find super string by word match in any order. i.e. all words in the input string has to occur in any order in output string. e.g. given data set: "string search" "java string search" "manual c string search equals" "java search code" "c java code search" ...

input: "java search" output: 1) "java string search" 2) "java search code" 3) "c java code search"

input: "search c" output: 1) "manual c string search equals" 2) "c java code search"

This can be done in a very trivial way with word by word matching. Here mainly I am looking for an efficient algo.

Input: A few billion records in given data set (mostly 1 to 10 words length string). I need to find super string for millions of strings. Note: words are extended dictionary ones.

regex comparison for one input string with all the data_set (which is billions in no.) is considerably high. Now I need to repeat the operation for another million (if not billion) input strings! — user2226441, Mar 30 '13 at 06:02

score 1 · Accepted Answer · answered Mar 30 '13 at 06:09

Preprocess your input (if possible), and index the words that appear in the dataset. Generate a mapping from each word to a set of possible output strings. For example, with the dataset

0 string search
1 java string search
2 manual c string search equals
3 java search code
4 c java code search

we get

c {2,4}
code {3,4}
equals {2}
java {1,3,4}
...

Then, searching for the matches for a given input is as simple as intersecting the sets corresponding to the input word:

input: "java c"
output: {1,3,4} intersect {2,4} = {4}

If you store the sets just as sorted lists, intersection can be done in linear time (linear in the total length of the input sets) by scanning across the lists in parallel.

score 0 · Answer 2 · edited May 23 '17 at 12:13

You basically need to find the intersection of two sets of words, input_words and data_words. If the intersection equals input_words, you have a match.

Here are efficient algorithms for set intersection: Efficient list intersection algorithm

An algorithm that comes to my mind and completes in O(n*m) [n = size input, m = size data] is.

Python:

match = True
for word in input.split():
  if word in data_words.split(): # linear search comparing word to each word
    continue
  else:
    match = False
    break

A search on sorted list would be quicker and hash lookups would be even more. Those are detailed in the link above.

Word-wise super string search for given string

2 Answers2

Linked