2

In python, How do I preserve paragraphs (i.e. keep newlines) when using lxml.html?

For example, the following will strip <p></p> tags and join the lines, which is not what I want:

body = doc.cssselect("div.body")[0]
content = body.text_content()

Here's what I've tried that doesn't work:

  • lxml.html.clean.clean_html:
    • Won't preserve the newlines.
  • content.replace(" "*3,"\n\n"):
    • Doesn't work consistently, because combined text does not have the same number of spaces.
Lionel
  • 3,188
  • 5
  • 27
  • 40

1 Answers1

2

The lxml text_content is doing what is supposed to according to the docs, it is stripping the html tags and leaving the text behind.

You can fix this up by adding your own newlines before outputting the content.

body = doc.cssselect("div.body")[0]
for para in body.xpath("*//p"):
    para.text = "\n%s\n" % para.text
content = body.text_content()
print content
Vince Spicer
  • 4,325
  • 3
  • 21
  • 11
  • Thanks, this is what I ended up doing: paragraphs = self.doc.cssselect('div#body p') paragraph_text = [paragraph.text_content() for paragraph in paragraphs] content = '\n\n'.join(paragraph_text) – Lionel Nov 23 '10 at 09:58
  • `body.xpath("*//p")` doesn't work for me, I changed it to `body.xpath("./p")`. +1 anyway – Tyler Liu Mar 12 '14 at 15:53
  • See also https://stackoverflow.com/questions/18660382/how-can-i-preserve-br-as-newlines-with-lxml-html-text-content-or-equivalent, which does a better replacement. – bortzmeyer Apr 11 '18 at 08:39