-1

I have this text:

before label bla bla bla aaaa<TAG1>bbbb bla bla bla bla abcd<TAG2>efgh after

and this regex:

label\W+(?:\w+\W+){1,60}?(?:.){0,}?(\<TAG1\>|\<TAG2\>)(?:.){0,}?\W+(?:\w+\W+){1,60}(?:.){0,}?(\<TAG2\>|\<TAG1\>)(?:.){0,}?

It does the job, it works as expected BUT does not really seem optimized.

This is a test: https://regex101.com/r/eS2kS6/1

Basically i have to find a label and after N words i should get <TAG1> or <TAG2> and after N words again i should get <TAG1> or <TAG2>.

NOTE:

It is very important that <TAG1> or <TAG2> must be seen as a possible "substring" of the word. Sometimes it can be aaaa<TAG1>bbbb, sometimes <TAG1> directly. As you can see in the example it works in both cases.

Dail
  • 4,622
  • 16
  • 74
  • 109
  • All the 'n words' stuff seems superfluous since you're already matching 'any' before, between and after the tags. – pvg Dec 08 '15 at 01:35

1 Answers1

1

It often helps to visualize the regular expression:

Regular expression visualization

Note that (?:.){0,}? is a roundabout way of saying .*. It's also easy to see now that there's two identical blocks which could merged, so lets fix that:

label\W+(?:(?:\w+\W+){1,60}?.*(\<TAG1\>|\<TAG2\>).*){2}

Regular expression visualization

This is equivalent, but shorter. From here it becomes a question of what exactly you're trying to match. All those \ws an \Ws look a little odd to me, especially when used alongside .'s. I generally prefer to match \s rather than \W since I usually really do mean "some sort of whitespace", but you'll need to decide which you actually need.

The "match-one-to-sixty-words-and-not-words-followed-by-anything" pattern you're using ((?:\w+\W+){1,60}?.*) is likely not what you want - it would match a$<TAG for instance, but not a<TAG. If you want to allow one or more words try (?:\s*\w+)+. This matches zero-or-more whitespace, followed by one-or-more characters, one or more times. If you want to limit that to 60 you can replace the final + with a {1,60} (but it's not clear from your description where the 60 comes from - do you need it?).

So here's where we are now:

label\s+(?:(?:\w+\s*)+(\<TAG1\>|\<TAG2\>)\w*){2}

Regular expression visualization

This isn't quite identical to your previous pattern - it doesn't match after in your example string (it's not clear from you description whether it should or not). If you want to keep matching after the second tag, just add a .* to the end.


All that said, it looks a lot like you're trying to parse a complex grammar (i.e. a non-regular language), and that is rife with peril. If you find yourself writing and rewriting a regular expression to try to make it capture the data you need, you may need to upgrade to a proper contextual parser.

In particular, neither your regular expression nor my tweaks enforce that N is the same each time. Your description makes it sound like you only want to match strings where there are N words preceeding the first tag, and exactly N words in-between it and the second tag. That sort of match might be possible with regular expressions, but it certainly wouldn't be clean. If that's a requirement, regular expressions likely aren't the right tool.

Community
  • 1
  • 1
dimo414
  • 47,227
  • 18
  • 148
  • 244
  • I do not have to parse an HTML document. The and are an example of words i need to check. but it can be replaced with "dog" and "cat" for example. However my goal is a sort of text mining, i have to find specific pattern and extract the content. What is the right tool for you? – Dail Dec 08 '15 at 09:20
  • I didn't say you were parsing an HTML document, I said you might be parsing a non-regular language (and HTML is a common example of such a language). A popular grammar generator and parser is [ANTLR](http://www.antlr.org/), but there are many others. – dimo414 Dec 08 '15 at 14:07