3

I have a book of 300000+ words.
Each word has metadata (grammatical information; parsing details and lemmas [root forms])

What is the best way to structure the data so that I can search for words or groups of words and get results fast. I would like to be able to search with requirements on the metadata as well.

I need to be able to search exact phrases or just words that are near one another.

My question is with regards to database design and querying method.

Swift
  • 13,118
  • 5
  • 56
  • 80
jcuenod
  • 55,835
  • 14
  • 65
  • 102

2 Answers2

2

I would highly recommend the Rabin–Karp Algorithm in this case. Although Rabin-Karp is not as fast as some other searching algorithms, it excels at matching multiple patterns and since you said you will be searching for multiple phrases and lemmas it is the best fit. Both the average and best case are in O(n + m) where n would be the combined length of the 300,000 words and m is the total length of the patterns you are searching for. In the worst case you hit O(mn) time.

As far as storing the data, you would use a large hash rolling table or more ideally a bloom filter.

Here are some related questions, articles, and implementations in C and ruby. Hope this helps.

Community
  • 1
  • 1
Swift
  • 13,118
  • 5
  • 56
  • 80
1

A starting point would be using Lucene+Solr setup and index the data you have.

Here is a sample tutorial: http://lucene.apache.org/solr/tutorial.html

rkg
  • 5,559
  • 8
  • 37
  • 50
  • Indexing is, naturally, long hanging fruit. Would you put each word in its own row though (with its corresponding metadata)? Or would you put the words in a paragraph together? How would you link them to the metadata then etc. are the kind of questions I'm thinking about... – jcuenod Jun 28 '11 at 22:34