2

My task is to search for a string or a pattern in a list of documents that are very short (say 200 characters long). However, say there are 1 million documents of such time. What is the most efficient way to perform this search?. I was thinking of tokenizing each document and putting the words in hashtable with words as key and document number as value, there by creating a bag of words. Then perform the word search and retrieve the list of documents that contained this word. From what I can see is this operation will take O(n) operations. Is there any other way? may be without using hash-tables?.

Also, is there a python library or third party package that can perform efficient searches?

Rkz
  • 1,237
  • 5
  • 16
  • 30

4 Answers4

4

Since you are looking for a library, have you taken a look at PyLucene?

http://lucene.apache.org/pylucene/features.html

While Lucene usually implements ranked retrieval (matches based on a relative score) - as opposed to exact matches - it can be used for exact phrase searching. Here's a link for how to use Lucene to search for an exact phrase. It's written in Java, but gives the idea:

Exact Phrase search using Lucene?

Your question asked specifically about efficiency. Efficiency in what way? I assume that you meant fastest look-up time for the user. If you are indeed talking about speed purely in terms of look-up time for the user, then there is no faster way than actually indexing all words in the document provided that you are willing to endure the initial time to index all documents in the corpus. This is usually the logical choice, since indexing is a one time event, and user searches are a frequent occurrence. Obviously, though, this comes with considerably large memory usage. So, if you are talking about efficiency in terms of memory usage, then you would want to loop over all documents and perform a regex search on each document. You would also use this method if you wanted to avoid the initial look-up time of indexing, though, again, this is unlikely the logical limiting factor given a large corpus size, and given that the concern is usually satisfying a user who will make multiple queries.

The only other thing I would point out is that, since you mentioned you are searching patterns and not just words, indexing only the words won't help if you are trying to support querying for patterns (unless that pattern is one of the words in the document!)

If you aren't going to use Lucene, and instead want to implement this on your own, take a look at indexing using inverted indeces. Here is an excellent explanation on how to create inverted indices if you are looking to do phrasal queries:

http://www.searchenginepeople.com/blog/how-search-really-works-the-index-2.html

Community
  • 1
  • 1
  • thanks a lot for the explanation. I was talking about efficiency in terms of look-up time and not memory.I will try the PyLucene and see how it works. – Rkz Oct 28 '12 at 23:45
3

Most search engines work by principle of inverted indexes. Basically for each token (word, trigram, etc.) you store a sorted list of documents that contain this token. When matching a query, you merge join the lists of all required tokens to produce the list of candidate documents. If index match doesn't guarantee the query match, the query expression must be retested on the matched document.

There are many solutions to store the inverted index, some of them already (Lucene, Sphinx, PostgreSQL FTS) support evaluating expressions on the inverted index.

The magic of search engines happens mostly in preprocessing and tokenizing the documents and generating search queries from user requests. Preprocessing tricks include word canonicalization by stemming the words and storing multiple different representations per word. For query construction, you may want to do things like synonym replacement. Regular expressions are a bit trickier, but there is a great talk about implementing index support for regular expression searches in PostgreSQL.

Ants Aasma
  • 53,288
  • 15
  • 90
  • 97
1

While your idea using hashtables to make bag of words sounds fun, I think by the time you open each file, read it into memory, tokenize it, make a hashtable, put each token into the hashtable, hash the search term, then index into your hashtable to find the document ids for each doc that contains the word, you've spent far more time than you would if you just use a regular expression and do a search in each file:

import re
import os
import sys

searchterm = sys.argv[1]
searchexp = re.compile("(%s)" % searchterm, re.M)

for filename in os.listdir(sys.argv[2]):
    f = open(os.path.join(sys.argv[2], filename), 'r')
    contents = f.read()
    f.close()
    if searchexp.search(contents):
        print(filename)

Is that too slow?

joe
  • 617
  • 7
  • 22
  • thanks. I know about the regex, but I wanted to get further inputs to evaluate my thought process. I will try and see which one performs better. – Rkz Oct 28 '12 at 05:03
1

I don't think there is a better solution to this problem than the one described here by Russ Cox, which he developed for the sadly-decommissioned Google Code Search engine.

rici
  • 234,347
  • 28
  • 237
  • 341