How do I strip patterns or words from the end of the string backwards?

Question

I have a string like this:

<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>

I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.

I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:

<v1>aaa<b>bbb</b>ccc</v1>

I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.

Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:

If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result.

But .rstrip is not good with words, and won't do patterns either.

I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.

What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?

Which strategy to follow to strip the patterns from the end of the string?

But parsing *is* the right answer to this problem. Why don't you want to do it? — Daniel Roseman, Mar 18 '14 at 12:10
@DanielRoseman because it is too heavy-weight for *this* task. This is temporary visualization which will be thrown away. Parsing will do but it's like using a sledgehammer to crack a nut. — n611x007, Mar 18 '14 at 12:43

score 3 · Accepted Answer · answered Mar 18 '14 at 12:10

3

The simplest would be to use old-fashing string splitting and limiting the split:

in_str.split('>', 3)[-1].rsplit('<', 3)[0]

Demo:

>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'

str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.

answered Mar 18 '14 at 12:10

Martijn Pieters

1,048,767
296
4,058
3,343

Haha! Gotcha, seems I couldn't see the wood for the trees. – n611x007 Mar 18 '14 at 12:14
1

yeah, this is definitely the best solution for this particular problem. – Corley Brigman Mar 18 '14 at 13:19

score 2 · Answer 2 · answered Mar 18 '14 at 12:26

You've already got practically all the solution. re can't do backwards, but you can:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]

print in_str
<v1>aaa<b>bbb</b>ccc</v1>

Note the reversed regex for the reversed string, but then it goes back-to-front.

Of course, as mentioned, this is way easier with a proper parser:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>

+1 Nice back-pattern and the notation for the string reversing. It will solve reverse-matching. Still, for this specific case, Martijn's [answer](http://stackoverflow.com/a/22478982/611007) beats it with the simplicity. I think adding the elementTree example makes this the best complementary reference answer. — n611x007, Mar 18 '14 at 12:47

score 1 · Answer 3 · answered Mar 18 '14 at 12:14

1

I would look into regular expressions and use one such pattern to use a split

http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

answered Mar 18 '14 at 12:14

scjepsen

11
1

hm, yeah, maybe I could split based on the pattern for the last 3 tags. – n611x007 Mar 18 '14 at 12:16

score 1 · Answer 4 · answered Mar 18 '14 at 14:21

1

Sorry, can't comment, but will give it as an answer.

in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>. You just should be aware of this.

To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

answered Mar 18 '14 at 14:21

SCI

546
3
6

yes, thanks! `test` is not my case. In my case the format is as in my example. It would constitute just a common corner case to something for what the general answer is indeed parsing. However I was mainly interested in the right-to-left string manipulation aspect. – n611x007 Mar 18 '14 at 16:02

How do I strip patterns or words from the end of the string backwards?

4 Answers4