Optimize a find and match code in Python

Question

I have a code which takes as input two files: (1) a dictionary/lexicon (2) a text file (one sentence per line)

The first part of my code reads the dictionary in tuples so outputs something like:

('mthy3lkw', 'weakBelief', 'U')

('mthy3lkm', 'firmBelief', 'B')

('mthy3lh', 'notBelief', 'A')

The second part of the code is to search each sentence in the text file for the words in position 0 in those tuples and then print out the sentence, the search word and it's type.

So given the sentence mthy3lkw ana mesh 3arif , desired output is:

["mthy3lkw ana mesh 3arif", 'mthy3lkw', 'weakBelief', 'U'] given that the highlighted word is found in the dictionary.

The second part of my code - the matching part - is TOO slow. How do I make it faster?

Here is my code

findings = [] 
for sentence in data:  # I open the sentences file with .readlines()
    for word in tuples:  # similar to the ones mentioned above
        p1 = re.compile('\\b%s\\b'%word[0])  # get the first word in every tuple
        if p1.findall(sentence) and word[1] == "firmBelief":
            findings.append([sentence, word[0], "firmBelief"])

print findings

score 1 · Answer 1 · edited May 23 '17 at 10:24

1

Convert your list of tuples into a trie, and use that for searching.

edited May 23 '17 at 10:24

Community

1
1

answered Oct 07 '12 at 03:20

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

3

Can you expand your answer further to help the OP understand a trie, and how it will help speed up the search? – the Tin Man Oct 07 '12 at 03:53
I'm just starting with Python so I don't know what trie is. – Sabba Oct 07 '12 at 04:26
1

Sabba: A [trie](http://en.wikipedia.org/wiki/Trie) is not a Python thing, it's a data structure thing. It allows for fast searching due to the sequestering of similar words within a tree-like structure. – Ignacio Vazquez-Abrams Oct 07 '12 at 06:56

score 1 · Accepted Answer · answered Oct 08 '12 at 22:32

Build a dict lookup structure so you can find the correct one from your tuples quickly. Then you can restructure your loops so that instead of going through your whole dictionary for each sentence, trying to match every entry up, you instead go over each word in the sentence and look it up in the dictionary dict:

# Create a lookup structure for words
word_dictionary = dict((entry[0], entry) for entry in tuples)

findings = []
word_re = re.compile(r'\b\S+\b') # only need to create the regexp once
for sentence in data:
    for word in word_re.findall(sentence): # Check every word in the sentence
        if word in word_dictionary: # A match was found
            entry = word_dictionary[word]
            findings.append([sentence, word, entry[1], entry[2]])

Optimize a find and match code in Python

2 Answers2