Make list lookup faster?

Question

I have a file that every line is containing names of persons and a file containing text of speeches. The file with the names is very big(250k lines) ordered alphabetically, the speeches file has around 1k lines. What I want to do is a lookup for the names in my text file and do replacements for every occurring name from my names file. This is my code EDIT: The with function that opens the list is executed only one time.

members_list = []
with open(path, 'r') as l:
    for line in l.readlines():
        members_list.append(line.strip('\n'))

for member in self.members_list:
    if member in self.body:
        self.body = self.body.replace(member, '<member>' + member + '</member>')

This code takes about 2.2 seconds to run, but because I have many speech files (4.5k) the total time is around 3 hours. Is it possible to make this faster? Are generators the way to go?

each time you do `replace(member, '' + member + '')` an entirely new string is created. It's probably better to accumulate into a list and join at the end. — juanpa.arrivillaga, Dec 17 '17 at 19:16
Lose the readlines... Just use `for line in l:`. Also both loops should be combined — OneCricketeer, Dec 17 '17 at 19:29
The code that opens the list is run only one time, the time consuming function is the lookup. — kpapadop, Dec 17 '17 at 19:35
I'm assuming that names are not always one word. It's not as easy as just using a set to check each word as you stream through the speech. I was thinking store the names in a trie, and use backtracking as you stream through the speech, but I'm not sure if I have the expertise to write it up in a good answer. — Kenny Ostrom, Dec 17 '17 at 19:59
The problem with the OP's code is that it re-reads each speech in memory 250,000 times. You don't want to loop through names; you want to be able to stream the speech once and use linear time lookup of the names (which set would provide, if they are easy to tokenize, if for example you know how many words are in a name). — Kenny Ostrom, Dec 17 '17 at 20:02
@StefanPochmann it was more of an answer to the title, and I didn't read the code carefully. There is no list-lookup. But I do think looping over a tokenized `body` and using a set-based lookup would be much faster than running `.replace` in a loop — juanpa.arrivillaga, Dec 17 '17 at 20:21
Each name can be in the form of: name surname but there are some names that are in the form of name surname fathers_name — kpapadop, Dec 17 '17 at 20:23
Doesn't your code replace `Tiger Woods` with `Tiger Woods` if you have the name `Tiger Wood`? — Stefan Pochmann, Dec 17 '17 at 20:34

Kenny Ostrom · Accepted Answer · 2017-12-17T21:34:32.367

Currently, you re-read each speech in memory once for each of the 250,000 names when you check "if member in self.body".

You need to parse the speech body once, finding whole words, spaces, and punctuation. Then you need to see if you have found a name, using a linear time lookup of the known member names, or at worst log time.

The problem is you have to find member names which have various word lengths. So here is a quick (and not very good) implementation I wrote up to handle checking the last three words.

# This is where you load members from a file. 
# set gives us linear time lookup
members = set()
for line in ['First Person', 'Pele', 'Some Famous Writer']:
    members.add(line)

# sample text
text = 'When Some Famous Writer was talking to First Person about Pele blah blah blah blah'
from collections import deque

# pretend we are actually parsing, but I'm just splitting. So lazy.
# This is why I'm not handling punctuation and spaces well, but not relevant to the current topic
wordlist = text.split()

# buffer the last three words
buffer = deque()

# TODO: loop while not done, but this sort of works to show the idea
for word in wordlist:
    name = None
    if len(buffer) and buffer[0] in members:
        name = buffer.popleft()

    if not name and len(buffer)>1:
        two_word_name = buffer[0] + ' ' + buffer[1]
        if two_word_name in members:
            name = two_word_name
            buffer.popleft()
            buffer.popleft()

    if not name and len(buffer)>2:
        three_word_name = buffer[0] + ' ' + buffer[1] + ' ' + buffer[2]
        if three_word_name in members:
            name = three_word_name
            buffer.popleft()
            buffer.popleft()
            buffer.popleft()

    if name:
        print ('<member>', name, '</member> ')

    if len(buffer) >2:
        print (buffer.popleft() + ' ')

    buffer.append(word)

# TODO handle the remaining words which are still in the buffer
print (buffer)

I am just trying to demonstrate the concept. This doesn't handle spaces or punctuation. This doesn't handle the end at all -- it needs to loop while not done. It creates a bunch of temporary strings as it parses. But it illustrates the basic concept of parsing once, and even though it is horribly slow at parsing through the speech text, it might beat searching the speech text 250,000 times.

The reason you want to parse the text and check for name in set is that you do this once. A set has amortized linear time lookup, so it is much faster to check if name in members.

If I get the chance, I might edit it later to be a class that generates tokens, and fix finding names at the end, but I didn't intend this to be your final code.

For anyone curious about what amortised means in this context, this SO link may be useful: https://stackoverflow.com/questions/15079327/amortized-complexity-in-laymans-terms — Matt Fletcher, Dec 17 '17 at 21:38
Thank you for your effort, I will test the code tomorrow and give feedback. — kpapadop, Dec 17 '17 at 21:38
I did a quick test and indeed your code looks faster. Execution time 0.9 seconds, prior to mine that took 2.2. very good! — kpapadop, Dec 17 '17 at 21:51

Make list lookup faster?

1 Answers1