0

I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how.

I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has something to do with setting the Selector type to "xml" rather than "html" in order to recognize any XML tree, but for the life of me I cannot figure out how or where to do that in this instance.

For example, a website has the following HTML:

<h1 class="performance-title">
<p>Bernard Haitink conducts Brahms and&nbsp;Dvořák featuring pianist     Emanuel Ax
</p>
</h1>

I have made an item called Concert() that has a value called 'title'. In my item loader, I use:

def parse_item(self, response):       
    thisconcert = ItemLoader(item=Concert(), response=response)
    thisconcert.add_xpath('title','//h1[@class="performance-title"]/p/text()')

    return thisconcert.load_item()

This returns, in item['title'], a unicode list that does not include the text inside the p element, such as:

['\n                 ', '\n                 ', '\n                ']

I understand why, but I don't know how to get around it. I have also tried things like:

from scrapy import Selector

def parse_item(self, response):  

    s = Selector(text=' '.join(response.xpath('.//section[@id="performers"]/text()').extract()), type='xml')

What am I doing wrong here, and how can I parse HTML that contains this problem (p within h1)?

I have referenced the information concerning this specific issue at Behavior of the scrapy xpath selector on h1-h6 tags but it does not provide a complete solution that can be applied to a spider, only an example within a session using a given text string.

NFB
  • 642
  • 8
  • 26

2 Answers2

1

That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p> tag that should be contained within the <h1> tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>, whereas the response obtained from the site shows it as :

<h1 class="performance-title">\n</h1>
<p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax
</p>

As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p> tag hence is :

response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()

This is by using the <h1 class="performance-title"> as a landmark and finding its sibling <p> tag

Kaushik NP
  • 6,733
  • 9
  • 31
  • 60
0
//*[@id="content"]/section/article/section[2]/h1/p/text()
mtt2p
  • 1,818
  • 1
  • 15
  • 22
  • 1
    Could you show me in what context you are getting this to work? When put into my scrapy code above, this does not return any item['title'] at all. I have referenced: https://stackoverflow.com/questions/19779519/is-it-valid-to-have-paragraph-elements-inside-of-a-heading-tag-in-html5-p-insid and http://techqa.info/programming/question/41063971/Behavior-of-the-scrapy-xpath-selector-on-h1-h6-tags – NFB Jun 04 '17 at 17:15
  • Revised the question to include these links. – NFB Jun 04 '17 at 17:22