Lookup strategy for database of N-grams

Question

I have a large database (~ 2 TB in raw uncompressed text) of N-grams. Example lines from a 4-gram look like:

cat in the cradle 2
cat in the hat 187
cat in the window 32

i.e. 4 strings text with a single (possibly large) integer (w1,w2,w3,w4,c). I've managed to put the data in a database with indexing on [w1,w2,w3]. Lookups where the first words match a given query and the last is wild:

SELECT * FROM db WHERE (w1="cat" AND w2="in" AND w3="the")

are quite quick. I am interested in both that query and one where the first word is wild:

SELECT * FROM db WHERE (w2="in" AND w3="the" AND w4="hat")

No matter how I seem to design an index or database, the query is slow or the database size balloons to something extreme. In addition, building an index takes days on my computer so experimentation has been slow. I am looking for suggestions on how to manage such a query. I don't think I have enough hard-drive space to build an index for both [w1,w2,w3] and [w2,w3,w4], so any answer should try to fit within these constraints.

score 4 · Accepted Answer · answered Nov 24 '11 at 03:08

You might consider splitting out the words into a separate table, e.g.

CREATE TABLE word
  ( id INT PRIMARY KEY
  , text VARCHAR(32) NOT NULL UNIQUE
  )

Only a single copy of each unique word's characters are stored, with potential savings in disk space (only "potential" depending upon the average word length). More importantly, there would now only be a single string-based index that could be used for all words, regardless of their position in the N-gram. The N-grams would reference the words by their primary key identifiers instead of by their text:

CREATE TABLE ngram
   ( id INT PRIMARY KEY
   , w1Id INT FOREIGN KEY REFERENCES word(id)
   , w2Id INT FOREIGN KEY REFERENCES word(id)
   , w3Id INT FOREIGN KEY REFERENCES word(id)
   , w4Id INT FOREIGN KEY REFERENCES word(id)
   , n INT NOT NULL
   )

All of the foreign key indices would be integer-based instead of string-based.

Queries would be expressed something like this:

SELECT w1.text, w2.text, w3.text, w4.text, ng.n
FROM ngram AS ng
INNER JOIN word AS w1 ON w1.id = ng.w1Id
INNER JOIN word AS w2 ON w2.id = ng.w2Id AND w2.text = 'in'
INNER JOIN word AS w3 ON w3.id = ng.w3Id AND w2.text = 'the'
INNER JOIN word AS w4 ON w4.id = ng.w4Id AND w2.text = 'hat'

+1. Its funny that we were solving similar problems with similar methods at about the same time, isn't it (http://stackoverflow.com/questions/8247005/efficiently-working-with-and-generating-large-text-files)? — Leonid Shifrin, Nov 27 '11 at 17:16
@Leonid I had to double-take to see if was the same poster. Perhaps the posters were double-taking if it were the same responder? ;) — WReach, Nov 27 '11 at 18:30

score 2 · Answer 2 · answered Nov 25 '11 at 11:52

From the MySQL manual:

If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).

MySQL cannot use an index if the columns do not form a leftmost prefix of the index. Suppose that you have the SELECT statements shown here:

So you can try to make an index with all four column (w1, w2, w3, w4) and then change your second query in something like this:

SELECT * FROM db WHERE (w1 IS NOT NULL AND w2="in" AND w3="the" AND w4="hat")

This should use the index, but of course it works only if you do not have n-grams with w1 set to NULL. (note that an empty string like '' is not Null)

Anyway I suggest to try with the EXPLAIN command to check it out.

score 1 · Answer 3 · answered Nov 23 '11 at 20:16

If you can't predict the access pattern, or if you have to accommodate a number of arbitrary access patterns, single-column indexes are probably a better choice. Testing will tell; try testing on a subset of the data on a development computer.

If you build an index on the combination of four columns, {w1, w2, w3, w4}, then any query that omits column w1 from the WHERE clause will probably not use the index. The values "cat in the hat", "man in the hat", and "where in the hat" would all be widely separated in the composite index.

Your dbms, no matter which one, will give you some way to see what the query optimizer is doing.

score 0 · Answer 4 · answered Nov 23 '11 at 18:00

Create a composite index on (w2, w3). Use queries with WHERE clause that compares w2 and w3 in the index order and then use other non-indexed comparisons.

SELECT * FROM db WHERE (w2="in" AND w3="the" AND w1="cat") 
SELECT * FROM db WHERE (w2="in" AND w3="the" AND w4="hat")

Lookup strategy for database of N-grams

4 Answers4