Due to the reasons outlined in this question I am building my own client side search engine rather than using the ydn-full-text
library which is based on fullproof
. What it boils down to is that fullproof
spawns "too freaking many records" in the order of 300.000 records whilst (after stemming) there are only about 7700 unique words. So my 'theory' is that fullproof is based on traditional assumptions which only apply to the server side:
- Huge indices are fine
- Processor power is expensive
- (and the assumption of dealing with longer records which is just applicable to my case as my records are on average 24 words only1)
Whereas on the client side:
- Huge indices take ages to populate
- Processing power is still limited, but relatively cheaper than on the server side
Based on these assumptions I started of with an elementary inverted index (giving just 7700 records as IndexedDB
is a document/nosql database). This inverted index has been stemmed using the Lancaster stemmer (most aggressive one of the two or three popular ones) and during a search I would retrieve the index for each of the words, assign a score based on overlap of the different indices and on similarity of typed word vs original (Jaro-Winkler distance).
Problem of this approach:
- Combination of "popular_word + popular_word" is extremely expensive
So, finally getting to my question: How can I alleviate the above problem with a minimal growth of the index? I do understand that my approach will be CPU intensive, but as a traditional full text search index seems unusably big this seems to be the only reasonable road to go down on. (Pointing me to good resources or works is also appreciated)
1 This is a more or less artificial splitting of unstructured texts into small segments, however this artificial splitting is standardized in the relevant field so has been used here as well. I have not studied the effect on the index size of keeping these 'snippets' together and throwing huge chunks of texts at fullproof
. I assume that this would not make a huge difference, but if I am mistaken then please do point this out.