Which of the following datastructure is the best regarding frequent searching?

Question

I have a text file with some content. I need to search this content frequently. I have the following two options, which one is the best (by means of faster execution) ?

METHOD 1:

def search_list(search_string):
    if search_word in li:
        print "found at line ",li.indexOf(search_word)+1

if __name__="__main__":
    f=open("input.txt","r")
    li=[]
    for i in f.readlines():
        li.append(i.rstrip("\n"))
    search_list("appendix")

METHOD 2:

def search_dict(search_string):
    if d.has_key(search_word):
        print "found at line ",d[search_word]

if __name__="__main__":
    f=open("input.txt","r")
    d={}
    for i,j in zip(range(1,len(f.readlines())),f.readlines()):
        d[j.rstrip("\n")]=i
    search_dict("appendix")

Can you define "better"? Easier to read, faster, less CPU intensive etc etc. — Ben, Sep 17 '12 at 12:16

score 2 · Answer 1 · answered Sep 17 '12 at 12:14

If you do it really frequently, then the second method will be faster (you've built something like an index).

Just adapt it a little bit:

def search_dict(d, search_string):
    line = d.get(search_string)
    if line:
        print "found at line {}".format(line)
    else:
        print "string not found"

d = {}
with open("input.txt", "r") as f:
    for i, word in enumerate(f.readlines(), 1):
        d[word.rstrip()] = i
search_dict(d, "appendix")

mgilson · Accepted Answer · 2012-09-17T12:34:25.543

For frequent searching, a dictionary is definitely better (provided you have enough memory to store the line numbers also) since the keys are hashed and looked up in O(1) operations. However, your implementation won't work. The first f.readlines() will exhaust the file object and you won't read anytihng with the second f.readlines().

What you're looking for is enumerate:

with open('data') as f:
    d = dict((j[:-1],i) for i,j in enumerate(f,1))

It should also be pointed out that in both cases, the function which does the searching will be faster if you use try/except provided that the index you're looking for is typically found. (In the first case, it might be faster anyway since in is an order N operation and so is .index for a list).

e.g.:

def search_dict(d, search_string):
    try:
        print "found at line {0}".format(d[search_string])
    except KeyError:
        print "string not found"

or for the list:

def search_list(search_string):
    try:
        print "found at line {0}".format(li.indexOf(search_word)+1)
    except ValueError:
        print "string not found"

score 1 · Answer 3 · edited May 23 '17 at 12:12

I'm posting this after reading the answers of eumiro and mgilson.

If you compare your two methods on the command line, I think you'll find that the first one is faster. The other answers that say the second method is faster, but they are based on the premise that you'll do several searches on the file after you've built your index. If you use them as-is from the command line, you will not.

The building of the index is slower than just searching for the string directly, but once you've built an index, searches can be done very quickly, making up for the time spent building it. This extra time is wasted if you just use it once, because when the program is complete, the index is discarded and has to be rebuilt the next run. You need to keep the created index in memory between queries for this to pay off.

There are several ways of doing this, one is making a daemon to hold the index and use a front-end script to query it. Searching for something like python daemon client communication on google will give you pointers on implementing this -- here's one method.

Interesting point. I saw "frequent searching" in the title and assumed the index would be kept in memory in between ... I still hold that the OP should definitely use `try/except` (especially with the second method). — mgilson, Sep 17 '12 at 12:36
For a 1 time search on a given index term, it might be faster to use: `def is_term(key,idx,value): if value == key.rstrip('\n'): print idx; return True; else: return False` and then use it as: `any(is_term('appendix',i,j) for i,j in enumerate(f,1))` — mgilson, Sep 17 '12 at 12:40

score 0 · Answer 4 · answered Sep 17 '12 at 12:17

0

First one is O(n); second one is O(1), but it requires searching on the key. I'd pick the second one.

Neither one will work if you're ad hoc searches in the document. For that you'll need to parse and index using something like Lucene.

answered Sep 17 '12 at 12:17

duffymo

305,152
44
369
561

Jon Clements · Answer 5 · 2012-09-17T13:00:08.307

Another option to throw in is using the FTS provided by SQLite3... (untested and making the assumption you're looking for wholewords, not substrings of words or other such things)

import sqlite3

# create db and table
db = sqlite3.connect(':memory:') # replace with file on-disk?
db.execute('create virtual table somedata using fts4(line)')

# insert the data
with open('yourfile.txt') as fin:
    for lineno, line in enumerate(fin):
        # You could put in a check here I guess...
        if somestring in line:
            print lineo # or whatever....
        # put row into FTS table
        db.execute('insert into somedata (line) values (?)', (line,))
    # or possibly more efficient
    db.executemany('insert into somedata (line) values (?)', fin)
db.commit()

look_for = 'somestring'
matches = db.execute('select rowid from somedata where line match ?', (look_for,) )
print '{} is on lines: {}'.format(look_for, ', '.join(match[0] for match in matches))

If you only wanted the first line, then add limit 1 to the end of the query.

You could also look at using mmap to map the file, then use the .find method to get the earliest offset of the string, then assuming it's not -1 (ie, not found - let's say 123456), then do mapped_file[:123456].count('\n') + 1 to get the line number.

Which of the following datastructure is the best regarding frequent searching?

5 Answers5