1

I have a string like this:

<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>

I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.

I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:

<v1>aaa<b>bbb</b>ccc</v1>

I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.

Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:

If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result.

But .rstrip is not good with words, and won't do patterns either.

I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.

What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?

Which strategy to follow to strip the patterns from the end of the string?

Community
  • 1
  • 1
n611x007
  • 8,952
  • 8
  • 59
  • 102
  • 1
    But parsing *is* the right answer to this problem. Why don't you want to do it? – Daniel Roseman Mar 18 '14 at 12:10
  • @DanielRoseman because it is too heavy-weight for *this* task. This is temporary visualization which will be thrown away. Parsing will do but it's like using a sledgehammer to crack a nut. – n611x007 Mar 18 '14 at 12:43

4 Answers4

3

The simplest would be to use old-fashing string splitting and limiting the split:

in_str.split('>', 3)[-1].rsplit('<', 3)[0]

Demo:

>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'

str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

You've already got practically all the solution. re can't do backwards, but you can:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]

print in_str
<v1>aaa<b>bbb</b>ccc</v1>

Note the reversed regex for the reversed string, but then it goes back-to-front.

Of course, as mentioned, this is way easier with a proper parser:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>
Corley Brigman
  • 11,633
  • 5
  • 33
  • 40
  • 1
    +1 Nice back-pattern and the notation for the string reversing. It will solve reverse-matching. Still, for this specific case, Martijn's [answer](http://stackoverflow.com/a/22478982/611007) beats it with the simplicity. I think adding the elementTree example makes this the best complementary reference answer. – n611x007 Mar 18 '14 at 12:47
1

I would look into regular expressions and use one such pattern to use a split

http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

scjepsen
  • 11
  • 1
1

Sorry, can't comment, but will give it as an answer.

in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>. You just should be aware of this.

To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

SCI
  • 546
  • 3
  • 6
  • yes, thanks! `test` is not my case. In my case the format is as in my example. It would constitute just a common corner case to something for what the general answer is indeed parsing. However I was mainly interested in the right-to-left string manipulation aspect. – n611x007 Mar 18 '14 at 16:02