0

I am using the following with BeautifulSoup to fetch text from: https://alas.aws.amazon.com/ALAS-2015-530.html

description = " ".join(xpath_parse(tree, '//div[@id="issue_overview"]/p/text()')).replace('. ()', '.\n')

However, the content is stripped of all HTML tags. I get - "As discussed in , Ruby's OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lea d to similar bugs such as ."

My xpath_parse is simple:

    def xpath_parse(tree, xfilter):
  return tree.xpath(xfilter)

Can someone tell me why is this happening?

Metahuman
  • 192
  • 2
  • 2
  • 11
  • You're asking for a string created by concatenating text nodes. Text nodes aren't expected to have markup in them -- the markup is, well, all the **other** (non-text) element node types. – Charles Duffy May 17 '16 at 22:17
  • ...which is to say: Your XPath is giving you **exactly** what it's written to return. – Charles Duffy May 17 '16 at 22:18

1 Answers1

2

That's because of the /text() part - it would get all the text nodes directly located under the /div[@id="issue_overview"]/p only.

Instead, assuming you are using the lxml.html package, use .text_content() method:

Returns the text content of the element, including the text content of its children, with no markup.

tree.xpath('//div[@id="issue_overview"]')[0].text_content()

Demo:

>>> from lxml.html import fromstring
>>> import requests
>>>
>>> url = "https://alas.aws.amazon.com/ALAS-2015-530.html"
>>> response = requests.get(url)
>>> root = fromstring(response.content)
>>> overview = root.xpath('//div[@id="issue_overview"]')[0].text_content().replace("Issue Overview:", "").strip()
>>> print(overview)                                                                                                                                                                                      
As discussed in an upstream announcement, Ruby's OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as CVE-2014-1492 .

Or, if you need to get the markup of the element - use the tostring() method:

>>> from lxml.html import fromstring, tostring
>>> tostring(root.xpath('//div[@id="issue_overview"]/p')[0])
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 <i class="icon-external-link"></i></a>.</p>\n            '

And, after removing the i elements:

>>> overview = root.xpath('//div[@id="issue_overview"]/p')[0]
>>> for i in overview.xpath(".//i"):
...     i.getparent().remove(i)
... 
>>> tostring(overview)
'<p>As discussed in <a href="https://www.ruby-lang.org/en/news/2015/04/13/ruby-openssl-hostname-matching-vulnerability/">an upstream announcement</a>, Ruby\'s OpenSSL extension suffers a vulnerability through overly permissive matching of hostnames, which can lead to similar bugs such as <a href="https://access.redhat.com/security/cve/CVE-2014-1492" target="_blank">CVE-2014-1492 </a>.</p>\n            '
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Exactly what @CharlesDuffy said. How do I get the markup? I got the text part alright. But it returns without all the markup. – Metahuman May 17 '16 at 22:29
  • @Metahuman ah, gotcha - I thought you were asking about the missing child node texts, like `an upstream announcement`. Okay, let me update the answer. – alecxe May 17 '16 at 22:30
  • @Metahuman updated! Is this what you were asking about? Thanks. – alecxe May 17 '16 at 22:31
  • It sure is! I adapted it like this - description = tostring(xpath_parse(tree, '//div[@id="issue_overview"]/p')[0]) – Metahuman May 17 '16 at 22:39
  • Sorry I wasn't allowed to comment without unmarking as an answer. How do I get rid of the "" here? – Metahuman May 17 '16 at 22:45
  • @Metahuman sure, remove the `i` element(s) before issuing `tostring()`, see http://stackoverflow.com/questions/7981840/how-to-remove-an-element-in-lxml. Hope that helps. – alecxe May 17 '16 at 22:48
  • Sorry, this does not work on large pages like these - https://alas.aws.amazon.com/ALAS-2015-522.html. It simply renders until the first < / P> tag. – Metahuman May 18 '16 at 04:17