That's because of the /text()
part - it would get all the text nodes directly located under the /div[@id="issue_overview"]/p
only.
Instead, assuming you are using the lxml.html
package, use .text_content()
method:
Returns the text content of the element, including the text content of its children, with no markup.
tree.xpath('//div[@id="issue_overview"]')[0].text_content()
Demo:
>>> from lxml.html import fromstring
>>> import requests
>>>
>>> url = "https://alas.aws.amazon.com/ALAS-2015-530.html"
>>> response = requests.get(url)
>>> root = fromstring(response.content)
>>> overview = root.xpath('//div[@id="issue_overview"]')[0].text_content().replace("Issue Overview:", "").strip()
>>> print(overview)
As discussed in an upstream announcement, Ruby's OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as CVE-2014-1492 .
Or, if you need to get the markup of the element - use the tostring()
method:
>>> from lxml.html import fromstring, tostring
>>> tostring(root.xpath('//div[@id="issue_overview"]/p')[0])
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 <i class="icon-external-link"></i></a>.</p>\n '
And, after removing the i
elements:
>>> overview = root.xpath('//div[@id="issue_overview"]/p')[0]
>>> for i in overview.xpath(".//i"):
... i.getparent().remove(i)
...
>>> tostring(overview)
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 </a>.</p>\n '