0

Having a text sample of the following layout...

<​text text <​text text> text text>

<​text text> <text text text <text ​text text>

...i need to capture the <​enclosed parts​> as highlighted.

With a lazy quantifier, the regexp <.*?> returns...

<​text text <​text text> text text>

<​text text> <​text text text text <​text text>

...which misses the upper right part and wrongly includes the middle bottom part. I’ve also tried it with <.[^<]*?>, which does it right on the 2-nd row, but misses both left and right parts on the 1-st:

<​text text <​text text> text text>

<​text text> <​text text text text <​text text>

How would the regexp look to <​work as above​>?

Россарх
  • 147
  • 1
  • 9
  • Duplicate of https://stackoverflow.com/questions/46541043/check-words-that-start-and-end-with-same-letter-in-c-sharp/46541129#comment80036711_46541129 Use “grouping„ and `\1`. Furthermore, regexp IS NOT a good tool to match XML syntaxic langages : https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Gaétan RYCKEBOER Oct 03 '17 at 09:59
  • 2
    Consider using a parser instead. – Jan Oct 03 '17 at 10:00
  • 1
    Fix as `<[^<>]*>`, but if you are dealing with HTML/XML, you definitely need a parser. – Wiktor Stribiżew Oct 03 '17 at 10:01
  • Unfortunately this produces the same result as my 2-nd try. Also, for clarification, this isn’t net markup language, but rather syntax which was used to label parts in an ordinary text file. – Россарх Oct 03 '17 at 10:24

2 Answers2

1
grep -Pzo "(?:<(?:[^<]|[<].*[>])*>)*" /tmp/test1
<​text text <​text text> text text><​text text><text ​text text>

$ cat /tmp/test1

<​text text <​text text> text text>

<​text text> <text text text <text ​text text>

or as an alternative drop the multiline processing

$ grep -Po "(?:<(?:[^<]|[<].*[>])*>)*" /tmp/test1
<​text text <​text text> text text>
<​text text>
<text ​text text>
Calvin Taylor
  • 664
  • 4
  • 15
0

Matching balanced parentheses (or other symbols that match and embed) is a classic case of something that regular expressions cannot do on principle. This is a well-known mathematical result, and is beyond dispute (e.g., see exercise 1 here, or see this question). Of course you can write a regexp that can handle one level of embedding, or two levels of embedding, but the regular expression language simply cannot handle an unrestricted number of levels.

The truth is that today's "perl-style regular expressions" have enough extensions (such as the \1 back-substitutions) that it is possible to hack something together, for some non-regular (as they are called) tasks... not sure if yours is included. But it's going to be ugly and complicated, and you run a risk that it'll blow up to exponential runtimes with the wrong input.

I recommend you put aside regular expressions, and write a loop that iterates over characters in your text and counts the embedding level. Simple, readable, and done in a single pass on your string, no backtracking:

starts = []
for i in range(len(text)):
if text[i] == "<":
    starts.append(i)
elif text[i] == ">" and len(starts) > 0:
    print(text[starts[-1]:i+1])
    starts.pop()
alexis
  • 48,685
  • 16
  • 101
  • 161
  • Thank you for your input. For this task i’m still more inclined to use regexps, such as [this](http://stackoverflow.com/questions/46541853/regexp-filtering-a-pattern-enclosed-by-the-same-pattern#comment80046644_46543010) well working one. – Россарх Oct 03 '17 at 14:11
  • Suit yourself. I guess you never embed more than one level of brackets... – alexis Oct 03 '17 at 14:30