5

My code does the following:

  1. Take a large text file (i.e. a legal document that is 300 pages as a PDF).
  2. Find a certain keyword (e.g. "small").
  3. Return n words to the left and n words to the right of the keyword.

NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.

Here is my problem: The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.

Here is my code:

fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()

def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately

    surround = r"\s*(\S*)\s*"
    groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
    return groups[:n],groups[n:]

Here is the nasty culprit:

print search("\$27.5 million", document, 10)

Here's how you can test this code: Copy the function definition from the code block above and run the following:

t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)

I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.

How do I speed up my code?

user1274740
  • 83
  • 1
  • 9
  • I felt like this should be suited for http://codreview.stackexchange.com but well, nice question. – aIKid Nov 19 '13 at 21:58
  • (1) Are you wedded to regex? (2) Do you intend `'small,'` and `'small'` to be different words? (They are by your definition, but maybe you didn't mean for them to be.) – DSM Nov 19 '13 at 22:02
  • 1
    what about first finding the line with given word only ( -> simpler regex -> probably faster) and nowm when you know where is it located, you can cheaply retrieve surrounding words .) – yedpodtrzitko Nov 19 '13 at 22:03
  • DSM: 'small,' returns two entities ('small' and ','), which is what I need. yedpodtrzitko: I thought that was what my code is currently doing? – user1274740 Nov 19 '13 at 22:06
  • No, what your code does is trying to find everything at the same time. How does your search preform if you search *only for the word* (no groups, no surroundings)? – Tomalak Nov 19 '13 at 22:14
  • Also, is it necessary to search for a word via regex? You seem to want a verbatim substring search, so `find()` used [like this](http://stackoverflow.com/a/4665027/18771) will likely outperform regular expressions. – Tomalak Nov 19 '13 at 22:21
  • Tomalak, I agree: I think the solution is to get the index of the first character of the word, get 100 or so characters around that index, and run my regex on it to get the pre- and post-words. – user1274740 Nov 19 '13 at 22:28
  • Last tip: If anything, you want `(?:^|\s)(\S+)(?:\s|$)`. The stars happily match zero occurrences and you have your catastrophic backtracking right there. (And no, not "100 or so characters" - you *know* how many whitespaces you want to go in each direction, after all) – Tomalak Nov 19 '13 at 22:32

4 Answers4

5

How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.

Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).

hobbs
  • 223,387
  • 19
  • 210
  • 288
2

I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:

"Lorem ipsum tritani impedit civibus ei pri"

...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:

regex:      \s*    \S*    \s*    tritani

offset 0:   ''   'Lorem'   ' '   FAIL
            ''   'Lorem'   ''    FAIL
            ''   'Lore'    ''    FAIL
            ''   'Lor'     ''    FAIL
            ''   'Lo'      ''    FAIL
            ''   'L'       ''    FAIL
            ''   ''        ''    FAIL

...then it bumps ahead one position and starts over:

offset 1:   ''   'orem'   ' '    FAIL
            ''   'orem'   ''     FAIL
            ''   'ore'    ''     FAIL
            ''   'or'     ''     FAIL
            ''   'o'      ''     FAIL
            ''   ''       ''     FAIL

... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:

position 5: ' '  'ipsum'  ' '    'tritani'

And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:

\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*

I sure you can see where this is is going. Note especially that when I change it to this:

(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)

...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.

Finally, you may have noticed the \s*\s* the auto-generated regex

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

You could try using mmap and appropriate regex flags, eg (untested):

import re
import mmap

with open('your file') as fin:
    mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
    for match in re.finditer(your_re, mf, flags=re.DOTALL):
        print match.group() # do something with your match

This'll only keep memory usage lower though...

The alternative is to have a sliding window of words (simple example of just single word before and after)...:

import re
import mmap
from itertools import islice, tee, izip_longest

with open('testingdata.txt') as fin:
    mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
    words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
    grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
    for group in izip_longest(*grouped, fillvalue=''):
        if group[1] == 'something': # check criteria for group
            print group
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • 1
    That's unlikely to help. The data is already in memory, it's the `re.search` that's taking all the time. `mmap` could theoretically lower memory consumption and/or save on the initial `read`, but it won't make the regex engine go any faster. :) – hobbs Nov 19 '13 at 22:13
  • @user1274740 then the next thing is to try to see if you can tokenize the file and use the sliding window to locate the text you want to grab the surrounding text from instead of using an re... – Jon Clements Nov 19 '13 at 22:30
-1

I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)

I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox

with re_search you could solve this problem with something like:

from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str)  # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
    if isinstance(regpart, basestring):
        words = textools.re_search('\w+', regpart)
        # do stuff with words
    else:
        # I Think you are ignoring these? Not totally sure

Here is a link on how to use and how it works: http://cloudformdesign.com/?p=183

In addition to this, your regular expressions would also be printed out in more readable format.

You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.

vitiral
  • 8,446
  • 8
  • 29
  • 43
  • 1
    This doesn't seem to address the questioner's specific issues at all, and seems awfully close to spam ("don't use `re`, use my module instead"). Further, your example has some very bad code, like `type(regpart) == str` (use `isinstance`!). – Blckknght Nov 20 '13 at 00:48
  • Thanks for the isinstance correction. You could of course use the re module to get the text in-between your found items -- but it is a pain (as far as I know). In my opinion, the re module itself is pretty crappy. – vitiral Nov 20 '13 at 00:59