find multiple needles in haystack - String Searching

Question

Question: How to find the existence of strings within a body of content from a document with sub-linear performance and where the string to be found must be done so in order or their associated id not alphabetical order.

Preferably we would solve this in PHP and or JAVA

Could a trie or Knuth-Pratt-Morris or boyer-moore implementation or other similar algo help find these matches in sub-linear time and if so can you show me how.

Some more details

The list length could be millions of rows. Each string could contain characters (a-z0-9) and white space ie "stack overflow", "stackoverflow" Each String has a unique identifier (id) which is an integer. {"s":"stackoverflow", "#":"920001"} The strings matched or found should be found in order of their unique identifier. Also worth noting. The string list does not change frequently. The content does.

*Example

An array of strings (920001 unique strings) and 2 document examples. Check for the existence strings from our list with in the content. continue to find matches until 3 strings are found or until the list is exhausted. when a string is found in the content out the string in a new array matches[]

as you can see the string "stackoverflow" is long way down the list, at the end, but in example 2 we would only match strings and one of them is stackoverflow which would take quite a few seconds to match using a simple loop and match of the string array.

for the purpose of this please treat the list below as if it has 920001 rows and that the strings in rows between 12 and 920000 do not contain any matches.

** example list

"strings":[
    {"s":"Disney World", "#":"1"}, 
    {"s":"Universal Studios", "#":"2"}, 
    {"s":"Disneyland", "id":"3"}, 
    {"s":"Slide", "id":"4"}, 
    {"s":"Disneyland", "id":"5"}, 
    {"s":"Plane", "id":"6"}, 
    {"s":"Walt Disney World", "#":"7"}, 
    {"s":"Florida", "#":"8"}, 
    {"s":"Puerto Rico", "#":"9"}, 
    {"s":"Dominican Republic", "id":"10"}, 
    {"s":"Las Vegas", "#":"11"},
    {"s":"Mexico", "#":"12"}
    ....
    ....
    {"s":"United States", "#":"920000"}
    {"s":"stackoverflow", "#":"920001"}
]

** examples of content

content = "Bordered on the west by the Gulf of Mexico and on the east by the Atlantic Ocean, Florida has the longest coastline in the contiguous United States and its geography is dominated by water and the threat of frequent hurricanes. Whether you’re a native or just visiting stackoverflow"

content ="tourist attractions and amusement parks. Slide to the seaside hot spots and abundant nightlife, what you need to stay on top of all of the new developments in the Panhandle State today stackoverflow"

That is the problem as I see it.

Any reason the list can't be sorted alphabetically after updating and then when you run a query sort the results into identifier order? — Jaydee, Jul 10 '14 at 14:00
I have tried looping the array looking for matches ie something along the lines of function contains($input, array $referers) { foreach($referers as $referer) { if (stripos($input,$referer) !== false) { return true; } } return false; } if ( contains($referer, $valid_referers) ) { // contains } — Orcra, Jul 10 '14 at 14:00
@Jaydee my thoughts behind not sorting alphabetically after updating and then when you run a query sort the results into identifier order was that using that method you would have to run through the entire list every time. where as if it was traveresed in ID order you only search until you find the number of matches you require and thus save time. that was my theory anyway. happy to be proved an idiot — Orcra, Jul 10 '14 at 14:28
If the strings are sorted alphabetically, you can binary search the list. Sorting the few "needles" afterwards would be very quick. Its like searchiong a telephone directory for ten names and phone numbers and then sorting the results into phone number order. — Jaydee, Jul 10 '14 at 15:09

score 3 · Answer 1 · edited May 23 '17 at 12:24

3

Build a suffix tree of your contents (merge all the suffix trees for each content) and then search your strings in this suffix tree.

If you use Ukkonen's algorithm, it is linear (=O(n+m) where n is the size of your contents and m the size of your strings).

You can't achieve sublinear performance since you need to read everything at least once if it matches.

edited May 23 '17 at 12:24

Community

1
1

answered Jul 10 '14 at 14:02

Thomash

6,339
1
30
50

would building a suffix tree be efficient and quick when there a billions of unique documents? – Orcra Jul 10 '14 at 14:30

find multiple needles in haystack - String Searching

1 Answers1