3

I have a very large file (80 GB) containing one sentence per line. I want to search for a user-given string for a match in this file (spaces, hyphens, case to be ignored).

Right now I have the file as text and I am using grep but it's taking a lot of time. What could be a better solution?

Example of contents of text file:

 applachian
 rocky mountains
 andes
 sierra nevada
 long mountain ranges of the world

Example of search query:

rocky (no match)
sierra nevada (match found)
  • What are you searching for? Is it "words", or "letters" or "phrases"? – aghast Apr 26 '17 at 16:36
  • Do you care about the ordering of sentences within the file? – Pedro Castilho Apr 26 '17 at 16:37
  • Next: are you doing this one time? Hundreds of times in a loop? In response to a web request? – aghast Apr 26 '17 at 16:37
  • 2
    Quick thought: you could split the text in several blocks and run as many threads, so they can search simultaneously. – Right leg Apr 26 '17 at 16:38
  • 2
    Generally, `set` and `dict` data structures are most effective for membership tests like the one you propose. – blacksite Apr 26 '17 at 16:39
  • Although `set` won't work if you care about repeated sentences. You need to provide more info about what exactly you want to do in order to get an appropriate response. – Pedro Castilho Apr 26 '17 at 16:40
  • @AustinHastings Entire sentence, and a large number of times in a loop. –  Apr 26 '17 at 16:40
  • @PedroCastilho There are no repeated sentences. And I don't care about the ordering. –  Apr 26 '17 at 16:41

2 Answers2

1

Based on your comment that you're searching for entire sentences:

Build an index of prefixes.

Sort the file. Next, process your file one time. Compute the length of the prefix needed to reduce a search to, say, 1000 sentences. That is, how many characters of prefix do you need to get within about 1000 sentences of a given sentence.

For example: "The" is probably a common starting word in English. But "The quick" is probably enough to get close, because "q" is low-frequency, to anything like "The quick brown fox ... etc."

One way to do this would be to put all prefixes up to a certain length (say, 40) into a Collections.counter. Find the maximum count at each length, and pick your length so that max is <= 1000. There may be other ways. ;-)

Now, process the file a second time. Build a separate index file, consisting of prefix-length (in the file header), prefixes and offsets. All sentences that start with prefix K begin at offset V. Because the file is sorted, the index will also be sorted.

Your program can read the index into memory, open the file, and start processing searches. For each search, chop off the prefix, look that up in the index, seek to the file offset, and scan for a match.

aghast
  • 14,785
  • 3
  • 24
  • 56
1

You can build a shardable DB by mapping the sentences to a hash, and then you can seek into your data at potential locations.

from collections import defaultdict
from cStringIO import StringIO

DATA = """applachian
rocky mountains
andes
sierra nevada
long mountain ranges of the world"""


def normalize(sentence):
    return "".join(sentence.lower().strip())


def create_db(inf):
    db = defaultdict(list)
    offset = 0
    for line in inf:
        l = len(line)
        db[hash(normalize(line))].append((offset, l))
        offset += l
    return db


def main():
    db = create_db(StringIO(DATA))
    # save this db, and in a different script, load it to retrieve:
    for needle in ["rocky", "sierra nevada"]:
        key = hash(normalize(needle))
        for offset, length in db.get(key, []):
            print "possibly found at", offset, length


if __name__ == '__main__':
    main()

This demonstrates the idea: you build a database (store as a pickle for example) of all normalised search keys mapping to locations where these are found. Then you can quickly retrieve the offset and length, and seek into that position in the real file, makeing a proper ==-based compare.

deets
  • 6,285
  • 29
  • 28