2

I'm iterating through an HTML tree with lxml iterwalk and I'd like to replace all <br> tags inside <pre></pre> with new line characters. That's what I have so far:

root = lxml.html.fromstring(text)
for action, el in etree.iterwalk(root):
    if el.tag == 'pre':
        for br in el.xpath('br'):
            # replace this <br> tag with "\n"

If possible in any way, the replacement should really be done inside this loop, because we need the loop anyways and including this step in there would probably be the most efficient way.

There's a similar question/answer on SO, but it didn't help solve the problem: How can one replace an element with text in lxml?

Community
  • 1
  • 1
Simon Steinberger
  • 6,605
  • 5
  • 55
  • 97

2 Answers2

5

drop_tree() method is exactly what you need:

.drop_tree():

Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.

Find all br elements inside pre, set the tail to \n and drop the element:

from lxml import etree
import lxml.html

text = """
<div>
    <pre>
        <br>
        test
        <br>
    </pre>
    <br>
</div>
"""

root = lxml.html.fromstring(text)
for action, el in etree.iterwalk(root):
    if el.tag == 'pre':
        for br in el.xpath('br'):
            br.tail = '\n' + br.tail
            br.drop_tree()

print etree.tostring(root)

Prints:

<div>
    <pre>


        test


    </pre>
    <br/>
</div>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
3

I understand that you have lxml as a requirement, but using BeautifulSoup for parsing and modifying HTML is much much more easy and fun. If speed really matters here, you can use lxml as an underlying parser:

from bs4 import BeautifulSoup

text = """
<div>
    <pre>
        <br>
        test
        <br>
    </pre>
    <br>
</div>
"""

soup = BeautifulSoup(text, "lxml")
for pre in soup.find_all('pre'):
    for br in pre.find_all('br'):
        br.replace_with('\n')

print soup.prettify()

Prints:

<html>
 <body>
  <div>
   <pre>


        test


    </pre>
   <br/>
  </div>
 </body>
</html>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195