3

Due to the reasons outlined in this question I am building my own client side search engine rather than using the ydn-full-text library which is based on fullproof. What it boils down to is that fullproof spawns "too freaking many records" in the order of 300.000 records whilst (after stemming) there are only about 7700 unique words. So my 'theory' is that fullproof is based on traditional assumptions which only apply to the server side:

  • Huge indices are fine
  • Processor power is expensive
  • (and the assumption of dealing with longer records which is just applicable to my case as my records are on average 24 words only1)

Whereas on the client side:

  • Huge indices take ages to populate
  • Processing power is still limited, but relatively cheaper than on the server side

Based on these assumptions I started of with an elementary inverted index (giving just 7700 records as IndexedDB is a document/nosql database). This inverted index has been stemmed using the Lancaster stemmer (most aggressive one of the two or three popular ones) and during a search I would retrieve the index for each of the words, assign a score based on overlap of the different indices and on similarity of typed word vs original (Jaro-Winkler distance).

Problem of this approach:

  • Combination of "popular_word + popular_word" is extremely expensive

So, finally getting to my question: How can I alleviate the above problem with a minimal growth of the index? I do understand that my approach will be CPU intensive, but as a traditional full text search index seems unusably big this seems to be the only reasonable road to go down on. (Pointing me to good resources or works is also appreciated)

1 This is a more or less artificial splitting of unstructured texts into small segments, however this artificial splitting is standardized in the relevant field so has been used here as well. I have not studied the effect on the index size of keeping these 'snippets' together and throwing huge chunks of texts at fullproof. I assume that this would not make a huge difference, but if I am mistaken then please do point this out.

Community
  • 1
  • 1
David Mulder
  • 26,123
  • 9
  • 51
  • 114

1 Answers1

3

This is a great question, thanks for bringing some quality to the IndexedDB tag.

While this answer isn't quite production ready, I wanted to let you know that if you launch Chrome with --enable-experimental-web-platform-features then there should be a couple features available that might help you achieve what you're looking to do.

  • IDBObjectStore.openKeyCursor() - value-free cursors, in case you can get away with the stem only
  • IDBCursor.continuePrimaryKey(key, primaryKey) - allows you to skip over items with the same key

I was informed of these via an IDB developer on the Chrome team and while I've yet to experiment with them myself this seems like the perfect use case.

My thought is that if you approach this problem with two different indexes on the same column, you might be able to get that join-like behavior you're looking for without bloating your stores with gratuitous indexes.

While consecutive writes are pretty terrible in IDB, reads are great. Good performance across 7700 entries should be quite tenable.

Community
  • 1
  • 1
buley
  • 28,032
  • 17
  • 85
  • 106
  • `openKeyCursor` and `continuePrimaryKey` awesome! – Kyaw Tun Mar 12 '14 at 13:14
  • continuePrimaryKey this is someting that can only be used in chrome? Another way is just opening the openkeycursor with a nextunique or prevunique direction – Kristof Degrave Mar 12 '14 at 15:02
  • I've not yet fully wrapped my head around it, but I think `continuePrimaryKey` allows you to use two different indexes at the same time: the index being queried and the "primary" index. – buley Mar 12 '14 at 16:17
  • 1
    usecase for `continuePrimaryKey` is discuss here https://www.w3.org/Bugs/Public/show_bug.cgi?id=20257 for openKeyCursor http://lists.w3.org/Archives/Public/public-webapps/2012OctDec/0465.html – Kyaw Tun Mar 12 '14 at 21:52