-1

I have a file that every line is containing names of persons and a file containing text of speeches. The file with the names is very big(250k lines) ordered alphabetically, the speeches file has around 1k lines. What I want to do is a lookup for the names in my text file and do replacements for every occurring name from my names file. This is my code EDIT: The with function that opens the list is executed only one time.

members_list = []
with open(path, 'r') as l:
    for line in l.readlines():
        members_list.append(line.strip('\n'))

for member in self.members_list:
    if member in self.body:
        self.body = self.body.replace(member, '<member>' + member + '</member>')

This code takes about 2.2 seconds to run, but because I have many speech files (4.5k) the total time is around 3 hours. Is it possible to make this faster? Are generators the way to go?

kpapadop
  • 13
  • 4

1 Answers1

1

Currently, you re-read each speech in memory once for each of the 250,000 names when you check "if member in self.body".

You need to parse the speech body once, finding whole words, spaces, and punctuation. Then you need to see if you have found a name, using a linear time lookup of the known member names, or at worst log time.

The problem is you have to find member names which have various word lengths. So here is a quick (and not very good) implementation I wrote up to handle checking the last three words.

# This is where you load members from a file. 
# set gives us linear time lookup
members = set()
for line in ['First Person', 'Pele', 'Some Famous Writer']:
    members.add(line)

# sample text
text = 'When Some Famous Writer was talking to First Person about Pele blah blah blah blah'
from collections import deque

# pretend we are actually parsing, but I'm just splitting. So lazy.
# This is why I'm not handling punctuation and spaces well, but not relevant to the current topic
wordlist = text.split()

# buffer the last three words
buffer = deque()

# TODO: loop while not done, but this sort of works to show the idea
for word in wordlist:
    name = None
    if len(buffer) and buffer[0] in members:
        name = buffer.popleft()

    if not name and len(buffer)>1:
        two_word_name = buffer[0] + ' ' + buffer[1]
        if two_word_name in members:
            name = two_word_name
            buffer.popleft()
            buffer.popleft()

    if not name and len(buffer)>2:
        three_word_name = buffer[0] + ' ' + buffer[1] + ' ' + buffer[2]
        if three_word_name in members:
            name = three_word_name
            buffer.popleft()
            buffer.popleft()
            buffer.popleft()

    if name:
        print ('<member>', name, '</member> ')

    if len(buffer) >2:
        print (buffer.popleft() + ' ')

    buffer.append(word)

# TODO handle the remaining words which are still in the buffer
print (buffer)

I am just trying to demonstrate the concept. This doesn't handle spaces or punctuation. This doesn't handle the end at all -- it needs to loop while not done. It creates a bunch of temporary strings as it parses. But it illustrates the basic concept of parsing once, and even though it is horribly slow at parsing through the speech text, it might beat searching the speech text 250,000 times.

The reason you want to parse the text and check for name in set is that you do this once. A set has amortized linear time lookup, so it is much faster to check if name in members.

If I get the chance, I might edit it later to be a class that generates tokens, and fix finding names at the end, but I didn't intend this to be your final code.

Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30
  • For anyone curious about what amortised means in this context, this SO link may be useful: https://stackoverflow.com/questions/15079327/amortized-complexity-in-laymans-terms – Matt Fletcher Dec 17 '17 at 21:38
  • Thank you for your effort, I will test the code tomorrow and give feedback. – kpapadop Dec 17 '17 at 21:38
  • I did a quick test and indeed your code looks faster. Execution time 0.9 seconds, prior to mine that took 2.2. very good! – kpapadop Dec 17 '17 at 21:51