I have a large database (~ 2 TB in raw uncompressed text) of N-grams. Example lines from a 4-gram look like:
cat in the cradle 2
cat in the hat 187
cat in the window 32
i.e. 4 strings text with a single (possibly large) integer (w1,w2,w3,w4,c)
. I've managed to put the data in a database with indexing on [w1,w2,w3]
. Lookups where the first words match a given query and the last is wild:
SELECT * FROM db WHERE (w1="cat" AND w2="in" AND w3="the")
are quite quick. I am interested in both that query and one where the first word is wild:
SELECT * FROM db WHERE (w2="in" AND w3="the" AND w4="hat")
No matter how I seem to design an index or database, the query is slow or the database size balloons to something extreme. In addition, building an index takes days on my computer so experimentation has been slow. I am looking for suggestions on how to manage such a query. I don't think I have enough hard-drive space to build an index for both [w1,w2,w3]
and [w2,w3,w4]
, so any answer should try to fit within these constraints.