Regexp: filtering a pattern enclosed by the same pattern

Question

Having a text sample of the following layout...

<text text <text text> text text>

<text text> <text text text <text text text>

...i need to capture the <enclosed parts> as highlighted.

ᅠ

With a lazy quantifier, the regexp <.*?> returns...

<text text <text text> text text>

<text text> <text text text text <text text>

...which misses the upper right part and wrongly includes the middle bottom part. I’ve also tried it with <.[^<]*?>, which does it right on the 2^-nd row, but misses both left and right parts on the 1^-st:

<text text <text text> text text>

<text text> <text text text text <text text>

ᅠ

How would the regexp look to <work as above>?

Duplicate of https://stackoverflow.com/questions/46541043/check-words-that-start-and-end-with-same-letter-in-c-sharp/46541129#comment80036711_46541129 Use “grouping„ and `\1`. Furthermore, regexp IS NOT a good tool to match XML syntaxic langages : https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la — Gaétan RYCKEBOER, Oct 03 '17 at 09:59
Fix as `<[^<>]*>`, but if you are dealing with HTML/XML, you definitely need a parser. — Wiktor Stribiżew, Oct 03 '17 at 10:01
Unfortunately this produces the same result as my 2-nd try. Also, for clarification, this isn’t net markup language, but rather syntax which was used to label parts in an ordinary text file. — Россарх, Oct 03 '17 at 10:24

score 1 · Accepted Answer · answered Oct 03 '17 at 10:59

1

grep -Pzo "(?:<(?:[^<]|[<].*[>])*>)*" /tmp/test1

<text text <text text> text text><text text><text text text>

$ cat /tmp/test1

<text text <text text> text text>

<text text> <text text text <text text text>

or as an alternative drop the multiline processing

$ grep -Po "(?:<(?:[^<]|[<].*[>])*>)*" /tmp/test1
<text text <text text> text text>
<text text>
<text text text>

answered Oct 03 '17 at 10:59

Calvin Taylor

664
4
15

parser smarsher – Calvin Taylor Oct 03 '17 at 11:00
Does this really capture everything except the middle bottom part for you, as seen **``** in the first box of my question..? – Россарх Oct 03 '17 at 11:07
it's working as I have shown, which I think is what you want. The second example in my answer shows the 3 capture groups found. – Calvin Taylor Oct 03 '17 at 11:24
But what about two levels of embedding? What about three, or four? – alexis Oct 03 '17 at 11:29
you can add more nested bits as you want. It depends on how much you hate writing parsers. – Calvin Taylor Oct 03 '17 at 11:39
Very well, thank you! Simplifying your regexp somewhat to `<([^<>]|<.*>)*>`, which still works great for what i wanted, i am accepting your answer. – Россарх Oct 03 '17 at 13:52

alexis · Answer 2 · 2017-10-03T11:31:37.067

Matching balanced parentheses (or other symbols that match and embed) is a classic case of something that regular expressions cannot do on principle. This is a well-known mathematical result, and is beyond dispute (e.g., see exercise 1 here, or see this question). Of course you can write a regexp that can handle one level of embedding, or two levels of embedding, but the regular expression language simply cannot handle an unrestricted number of levels.

The truth is that today's "perl-style regular expressions" have enough extensions (such as the \1 back-substitutions) that it is possible to hack something together, for some non-regular (as they are called) tasks... not sure if yours is included. But it's going to be ugly and complicated, and you run a risk that it'll blow up to exponential runtimes with the wrong input.

I recommend you put aside regular expressions, and write a loop that iterates over characters in your text and counts the embedding level. Simple, readable, and done in a single pass on your string, no backtracking:

starts = []
for i in range(len(text)):
if text[i] == "<":
    starts.append(i)
elif text[i] == ">" and len(starts) > 0:
    print(text[starts[-1]:i+1])
    starts.pop()

Thank you for your input. For this task i’m still more inclined to use regexps, such as [this](http://stackoverflow.com/questions/46541853/regexp-filtering-a-pattern-enclosed-by-the-same-pattern#comment80046644_46543010) well working one. — Россарх, Oct 03 '17 at 14:11
Suit yourself. I guess you never embed more than one level of brackets... — alexis, Oct 03 '17 at 14:30

Regexp: filtering a pattern enclosed by the same pattern

2 Answers2