Extract all words between two phrases using regex

Question

I'm trying to extract all the words between two phrases using the following regex:

\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b(.*)\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b

The documents I'm running this regex on are 10-K filings. The filings are too long to post here (see regex101 url below for example), but basically they are something like this:

ITEM 1. BUSINESS

lots of words

ITEM 2. PROPERTIES

lots of words

ITEM 3. LEGAL PROCEEDINGS

I want to extract all the words between ITEM 1 and ITEM 3. Note that the subtitles for each ITEM may be slightly different for each 10-K filing, hence I'm allowing for a few words between each word.

I keep getting catastrophic backtracking error, and I cannot figure out why. For example, please see https://regex101.com/r/zgTiyb/1.

What am I doing wrong?

Your question is Unclear to me. Can you ask differently? There is no `ITEM 1. BUSINESS` in your regex link. Please be very precise about your input data and expected output. — mickmackusa, Jul 03 '18 at 03:42
Okay, I'll have a look in notepad. (Google Chrome Find can't find that) ...notepad couldn't either (tabbing) — mickmackusa, Jul 03 '18 at 03:46
And you want to stop matching at line 946? ... that's a good sized chunk of characters. — mickmackusa, Jul 03 '18 at 03:49
Yes, that is correct. This is just one example 10-K, most are similar with slight difference in how they title ITEM 1, ITEM 2, ITEM 3, etc. — user4951834, Jul 03 '18 at 03:50
So you want the whole match captured as a single string eventhough `Item 2` will be in it? Or is it better to divided the matches into `Item 1`'s text and `Item 2`'s text? What is the next process for you in the project? What are you doing with the string? — mickmackusa, Jul 03 '18 at 03:51
It can be divided or as one string, it doesn't matter. Divided may be better. — user4951834, Jul 03 '18 at 03:52
I will be running some textual analysis on the extracted string. — user4951834, Jul 03 '18 at 03:54
I'm still searching. Here's food for thought: https://stackoverflow.com/q/8268624/2943403 and https://stackoverflow.com/q/18296441/2943403 — mickmackusa, Jul 03 '18 at 04:05
Does the regex look fine to you? If so, then maybe that is the problem and I'll tackle this issue another way if regex won't work. — user4951834, Jul 03 '18 at 04:07
Relevant reading from Revo: [Catastrophic backtracking](https://stackoverflow.com/a/39833391/2943403) — mickmackusa, Jul 03 '18 at 06:22
Thanks, looking into it because the regex Sebastian shared doesn't actually run in PHP for something reason (it runs just fine in regex101)... — user4951834, Jul 03 '18 at 06:27
Another educational page https://stackoverflow.com/questions/27237579/simple-alphanumeric-regex-single-spacing-without-catastrophic-backtracking — mickmackusa, Jul 03 '18 at 07:47

score 3 · Answer 1 · answered Jul 03 '18 at 08:03

Catastrophic backtracking has almost one main reason:

A possible match is found but can't finish.

You made too many positions available for regex to try. This hits backtracking limit on PCRE. A quick work around would be removing the only dot-star in regex in order to replace it with a restrictive quantifier i.e.

.{0,200}

See live demo here

But the better approach is re-constructing the regular expression:

\bitem\b.*?\b(?:1|one)\b(*COMMIT)\W+(?:\w+\W+){0,2}?business\b\h*\R+(?:(?!item\h+(?:3|three)\b)[\s\S])*+item\h+(?:3|three)\b\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings\b

See live demo here

Your own regex needs ~45K steps on given input string to find those two matches. In contrast, this modified regex needs ~8K steps to accomplish the task. That's a huge improvement.

The latter doesn't need s flag (and it shouldn't be enabled). I used (*COMMIT) backtracking verb to cause an early failure if a possible match is found but is likely to not finish.

@Sebastian Proske's solution matches three sub-strings but I don't think the third match is an expected match. This huge third match is the only reason for your regex to break.

Please read this answer to have a better insight into this problem.

score 0 · Answer 2 · answered Jul 03 '18 at 04:24

This isn't really catastrophic backtracking, just a whole lot of text and a comparedly low backtracking limit in regex101. In this scenario the use of .* isn't optimal, as it will match the whole remainder of the textfile once it is reached and then backtrack character after character to match the parts after it - which means a lot of characters to process.

Seems you can stick to \w+\W+ at that place as well and use lazy matching instead of greedy to get your result, like

\b(?:item\W+(?:\w+\W+){0,2}?(?:1|one)\W+(?:\w+\W+){0,3}?business)\b\W+(?:\w+\W+)*?\b(?:item\W+(?:\w+\W+){0,2}?(?:3|three)\W+(?:\w+\W+){0,3}?legal\W+(?:\w+\W+){0,3}?proceedings)\b

Note that the pcre engine optimizes (?:\w+\W+) to (?>\w++\W++) thus working by word-no-word-chunks instead of single characters.

FYI, while this code works in regex101, it crashes in PHP. – user4951834 Jul 03 '18 at 06:29 — user4951834, Jul 03 '18 at 06:29

Extract all words between two phrases using regex

2 Answers2

Linked