I have a string like this:
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>
I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.
I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3))
. How do I strip the closing tags? What should remain is:
<v1>aaa<b>bbb</b>ccc</v1>
I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.
Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:
If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result.
But .rstrip
is not good with words, and won't do patterns either.
I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.
What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?
Which strategy to follow to strip the patterns from the end of the string?