Python keeping newlines in lxml.html after cssselect and text_content()

Question

In python, How do I preserve paragraphs (i.e. keep newlines) when using lxml.html?

For example, the following will strip <p></p> tags and join the lines, which is not what I want:

body = doc.cssselect("div.body")[0]
content = body.text_content()

Here's what I've tried that doesn't work:

lxml.html.clean.clean_html:
- Won't preserve the newlines.
content.replace(" "*3,"\n\n"):
- Doesn't work consistently, because combined text does not have the same number of spaces.

score 2 · Accepted Answer · answered Nov 22 '10 at 16:06

2

The lxml text_content is doing what is supposed to according to the docs, it is stripping the html tags and leaving the text behind.

You can fix this up by adding your own newlines before outputting the content.

body = doc.cssselect("div.body")[0]
for para in body.xpath("*//p"):
    para.text = "\n%s\n" % para.text
content = body.text_content()
print content

answered Nov 22 '10 at 16:06

Vince Spicer

Thanks, this is what I ended up doing: paragraphs = self.doc.cssselect('div#body p') paragraph_text = [paragraph.text_content() for paragraph in paragraphs] content = '\n\n'.join(paragraph_text) – Lionel Nov 23 '10 at 09:58
`body.xpath("*//p")` doesn't work for me, I changed it to `body.xpath("./p")`. +1 anyway – Tyler Liu Mar 12 '14 at 15:53
See also https://stackoverflow.com/questions/18660382/how-can-i-preserve-br-as-newlines-with-lxml-html-text-content-or-equivalent, which does a better replacement. – bortzmeyer Apr 11 '18 at 08:39

1 Answers1