3

I'm using ElementTree to modify the following xml:

<li>
  <p>Some stuff goes in <b>bold</b> here </p>
</li>

I would like to remove all <p> from my <li> elements but keep the contents.

Like this:

<li>Some stuff goes in <b>bold</b> here</li>

I am currently using the following code, which works in simple cases (no text/tail, etc....):

# strip <p> from <li> elements
liElements = rootNode.findall('.//li')
for elem in liElements:
    para = elem.find(".//p")
    for child in para:
        elem.append(child)
    elem.text = para.text
    elem.remove(para)

There must an easier way to just strip a formatting tag.... I hope?

aldeb
  • 6,588
  • 5
  • 25
  • 48
akevan
  • 691
  • 1
  • 9
  • 21
  • Looks like you are processing HTML instead; unless it is really XHTML, use a HTML parser. The [BeautifulSoup HTML library](http://www.crummy.com/software/BeautifulSoup/bs4/) has a [`.unwrap()` method](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap) for just this task. – Martijn Pieters May 27 '13 at 20:16
  • My example uses HTML tags but my content is not just HTML. It has mostly custom tags. But I imagine it's all the same to a parser... all of my parsing code (there's a lot) uses ElementTree so I'd like to find a way to use that before converting to a different parsing library – akevan May 27 '13 at 20:45
  • With `ElementTree` there is no easier method, I'm afraid. – Martijn Pieters May 27 '13 at 20:56

1 Answers1

4

Perhaps the easiest way is to not use ElementTree to process HTML, but to use BeautifulSoup instead; the library handles unwrapping explicitly through the .unwrap() method:

for elem in soup.find_all('li'):
    for para in elem.find_all('p'):
        para.unwrap()
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Yup, no easier way with ElementTree. I've ported some section of my code over to BeautifulSoup... IMHO it's slightly nicer to use. Pretty close to lxml though. – akevan May 28 '13 at 22:05