In python, How do I preserve paragraphs (i.e. keep newlines) when using lxml.html?
For example, the following will strip <p></p> tags and join the lines, which is not what I want:
body = doc.cssselect("div.body")[0]
content = body.text_content()
Here's what I've tried that doesn't work:
- lxml.html.clean.clean_html:
- Won't preserve the newlines.
- content.replace(" "*3,"\n\n"):
- Doesn't work consistently, because combined text does not have the same number of spaces.