1

I'm just running into a weird thing. I'm prototyping text-crawling using the Open ANC as corpora.

There are some texts where the re module is just not responding. If someone can affirm that this about the RegEx complexity the re module can handle I'm fine.

The RegEx is preceding(?:[^A-Za-z0-9\n\r]*\w+[^A-Za-z0-9\n\r]*)+acquired

The text the problem occures is:

My claim is that Lincoln’s address expresses the same idea that was then current in Europe. Each people of common history and language constitutes a nation, and the natural form for the nation’s survival was in a state structure. The idea that Americans constituted an organic national unit explained, implicitly, why the eleven Southern states could not go their own way. As he assumed the presidency, Lincoln still spoke of the Union rather than a nation; but in the course of the debates in the decades immediately preceding, the notion of union had acquired the metaphysical qualities of nationhood. In his first inaugural address, Lincoln invoked the “bonds of affection,” and even before shots were fired on Fort Sumter in Charleston Harbor, he stressed the unbreakable ties of historical struggle:

python code to produce problem:

import re

txt = "post text here"
regex = r"preceding(?:[^A-Za-z0-9\n\r]*\w+[^A-Za-z0-9\n\r]*)+acquired"
re.findall(regex, txt)
gkhaos
  • 694
  • 9
  • 20

1 Answers1

3

Your pattern is affected by catastrophic backtracking.

Here is an alternative pattern that should work with your input:

regex = r"preceding[^A-Za-z0-9\n\r]+(?:\w+[^A-Za-z0-9\n\r]+)+?acquired"

This assumes that there must always be at least one non-word character separating the words (otherwise it would just match one long, unbroken word).

(See also: How can I recognize an evil regex?)

ekhumoro
  • 115,249
  • 20
  • 229
  • 336