How can i optimize this regex?

Question

I have this text:

before label bla bla bla aaaa<TAG1>bbbb bla bla bla bla abcd<TAG2>efgh after

and this regex:

label\W+(?:\w+\W+){1,60}?(?:.){0,}?(\<TAG1\>|\<TAG2\>)(?:.){0,}?\W+(?:\w+\W+){1,60}(?:.){0,}?(\<TAG2\>|\<TAG1\>)(?:.){0,}?

It does the job, it works as expected BUT does not really seem optimized.

This is a test: https://regex101.com/r/eS2kS6/1

Basically i have to find a label and after N words i should get <TAG1> or <TAG2> and after N words again i should get <TAG1> or <TAG2>.

NOTE:

It is very important that <TAG1> or <TAG2> must be seen as a possible "substring" of the word. Sometimes it can be aaaa<TAG1>bbbb, sometimes <TAG1> directly. As you can see in the example it works in both cases.

All the 'n words' stuff seems superfluous since you're already matching 'any' before, between and after the tags. — pvg, Dec 08 '15 at 01:35

score 1 · Accepted Answer · edited May 23 '17 at 12:30

It often helps to visualize the regular expression:

Note that (?:.){0,}? is a roundabout way of saying .*. It's also easy to see now that there's two identical blocks which could merged, so lets fix that:

label\W+(?:(?:\w+\W+){1,60}?.*(\<TAG1\>|\<TAG2\>).*){2}

This is equivalent, but shorter. From here it becomes a question of what exactly you're trying to match. All those \ws an \Ws look a little odd to me, especially when used alongside .'s. I generally prefer to match \s rather than \W since I usually really do mean "some sort of whitespace", but you'll need to decide which you actually need.

The "match-one-to-sixty-words-and-not-words-followed-by-anything" pattern you're using ((?:\w+\W+){1,60}?.*) is likely not what you want - it would match a$<TAG for instance, but not a<TAG. If you want to allow one or more words try (?:\s*\w+)+. This matches zero-or-more whitespace, followed by one-or-more characters, one or more times. If you want to limit that to 60 you can replace the final + with a {1,60} (but it's not clear from your description where the 60 comes from - do you need it?).

So here's where we are now:

label\s+(?:(?:\w+\s*)+(\<TAG1\>|\<TAG2\>)\w*){2}

This isn't quite identical to your previous pattern - it doesn't match after in your example string (it's not clear from you description whether it should or not). If you want to keep matching after the second tag, just add a .* to the end.

All that said, it looks a lot like you're trying to parse a complex grammar (i.e. a non-regular language), and that is rife with peril. If you find yourself writing and rewriting a regular expression to try to make it capture the data you need, you may need to upgrade to a proper contextual parser.

In particular, neither your regular expression nor my tweaks enforce that N is the same each time. Your description makes it sound like you only want to match strings where there are N words preceeding the first tag, and exactly N words in-between it and the second tag. That sort of match might be possible with regular expressions, but it certainly wouldn't be clean. If that's a requirement, regular expressions likely aren't the right tool.

I do not have to parse an HTML document. The and are an example of words i need to check. but it can be replaced with "dog" and "cat" for example. However my goal is a sort of text mining, i have to find specific pattern and extract the content. What is the right tool for you? — Dail, Dec 08 '15 at 09:20
I didn't say you were parsing an HTML document, I said you might be parsing a non-regular language (and HTML is a common example of such a language). A popular grammar generator and parser is [ANTLR](http://www.antlr.org/), but there are many others. — dimo414, Dec 08 '15 at 14:07

How can i optimize this regex?

1 Answers1