Question: How to find the existence of strings within a body of content from a document with sub-linear performance and where the string to be found must be done so in order or their associated id not alphabetical order.
Preferably we would solve this in PHP and or JAVA
Could a trie or Knuth-Pratt-Morris or boyer-moore implementation or other similar algo help find these matches in sub-linear time and if so can you show me how.
Some more details
The list length could be millions of rows. Each string could contain characters (a-z0-9) and white space ie "stack overflow", "stackoverflow" Each String has a unique identifier (id) which is an integer. {"s":"stackoverflow", "#":"920001"} The strings matched or found should be found in order of their unique identifier. Also worth noting. The string list does not change frequently. The content does.
*Example
An array of strings (920001 unique strings) and 2 document examples. Check for the existence strings from our list with in the content. continue to find matches until 3 strings are found or until the list is exhausted. when a string is found in the content out the string in a new array matches[]
as you can see the string "stackoverflow" is long way down the list, at the end, but in example 2 we would only match strings and one of them is stackoverflow which would take quite a few seconds to match using a simple loop and match of the string array.
for the purpose of this please treat the list below as if it has 920001 rows and that the strings in rows between 12 and 920000 do not contain any matches.
** example list
"strings":[
{"s":"Disney World", "#":"1"},
{"s":"Universal Studios", "#":"2"},
{"s":"Disneyland", "id":"3"},
{"s":"Slide", "id":"4"},
{"s":"Disneyland", "id":"5"},
{"s":"Plane", "id":"6"},
{"s":"Walt Disney World", "#":"7"},
{"s":"Florida", "#":"8"},
{"s":"Puerto Rico", "#":"9"},
{"s":"Dominican Republic", "id":"10"},
{"s":"Las Vegas", "#":"11"},
{"s":"Mexico", "#":"12"}
....
....
{"s":"United States", "#":"920000"}
{"s":"stackoverflow", "#":"920001"}
]
** examples of content
content = "Bordered on the west by the Gulf of Mexico and on the east by the Atlantic Ocean, Florida has the longest coastline in the contiguous United States and its geography is dominated by water and the threat of frequent hurricanes. Whether you’re a native or just visiting stackoverflow"
content ="tourist attractions and amusement parks. Slide to the seaside hot spots and abundant nightlife, what you need to stay on top of all of the new developments in the Panhandle State today stackoverflow"
That is the problem as I see it.