3

Related to but different from a previous question of mine, Extracting p within h1 with Python/Scrapy, I've come across a situation where Scrapy (for Python) will not extract a span tag within an h4 tag.

Example HTML is:

<div class="event-specifics">
 <div class="event-location">
  <h3>   Gourmet Matinee </h3>
  <h4>
   <span id="spanEventDetailPerformanceLocation">Knight Grove</span>
  </h4>
</div>
</div>

I'm attempting to grab the text "Knight Grove" within the span tags. When using scrapy shell on the command line,

response.xpath('.//div[@class="event-location"]//span//text()').extract()

returns:

['Knight Grove']

And

response.xpath('.//div[@class="event-location"]/node()')

returns the entire node, viz:

['\n                    ', '<h3>\n                        Gourmet Matinee</h3>', '\n                    ', '<h4><span id="spanEventDetailPerformanceLocation"><p>Knight Grove</p></span></h4>', '\n                ']

BUT, when then same Xpath is run within a spider, nothing is returned. Take for instance the following spider code, written to scrape the page from which the above sample HTML was taken, https://www.clevelandorchestra.com/17-blossom--summer/1718-gourmet-matinees/2017-07-11-gourmet-matinee/. (Some of the code is removed since it doesn't relate to the question):

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from concertscraper.items import Concert
from scrapy.contrib.loader import XPathItemLoader
from scrapy import Selector
from scrapy.http import XmlResponse

class ClevelandOrchestra(CrawlSpider):
    name = 'clev2'
    allowed_domains = ['clevelandorchestra.com']

    start_urls = ['https://www.clevelandorchestra.com/']

    rules = (
         Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
     thisconcert = ItemLoader(item=Concert(), response=response)
     for concert in response.xpath('.//div[@class="event-wrap"]'): 

        thisconcert.add_xpath('location','.//div[@class="event-location"]//span//text()')

     return thisconcert.load_item()

This returns no item['location']. I've also tried:

thisconcert.add_xpath('location','.//div[@class="event-location"]/node()')

Unlike in the question above regarding p within h, span tags are permitted within h tags in HTML, unless I am mistaken?

For clarity, the 'location' field is defined within the Concert() object, and I have all pipelines disabled in order to troubleshoot.

Is is possible that span within h4 is in some way invalid HTML; if not, what could be causing this?

Interestingly, going about the same task using add_css(), like this:

thisconcert.add_css('location','.event-location')

yields a node with the span tags present but the internal text missing:

['<div class="event-location">\r\n'
          '                    <h3>\r\n'
          '                        BLOSSOM MUSIC FESTIVAL </h3>\r\n'
          '                    <h4><span '
          'id="spanEventDetailPerformanceLocation"></span></h4>\r\n'
          '                </div>']

To confirm this is not a duplicate: It is true on this particular example there is a p tag inside of a span tag which is inside of the h4 tag; however, the same behavior occurs when there is no p tag involved, such as at: https://www.clevelandorchestra.com/1718-concerts-pdps/1718-rental-concerts/1718-rentals-other/2017-07-21-cooper-competition/?performanceNumber=16195.

NFB
  • 642
  • 8
  • 26
  • 1
    The span you refer to seems to be empty in your example URL. The text node thus doesn't exist and so it returns nothing. – Dom Weldon Jul 01 '17 at 00:27
  • Any further details you can provide on what you are seeing? For me, in FirePath the Xpath is isolating exactly the text I'm trying to extract at that URL. The span node itself contains a p node which contains this text -- which should be captured by the double-slash before text(). – NFB Jul 01 '17 at 00:32
  • 1
    Scrapy is not a web browser, and so it doesn't execute javascript etc. to change the page and render it like a web browser does. It seems that a script on the page must populate the value of that span when you load it in a web browser (hence why your xpath browser extension works), but scrapy doesn't run the script and load it (thus, it doesn't find a text node inside the span and so fails). – Dom Weldon Jul 01 '17 at 00:45
  • 1
    Seems there's an XHR request when the page runs to a site to populate the datepicker, you could find / work out the URL it calls to get the location as part of a JSON object - probably the best solution. – Dom Weldon Jul 01 '17 at 00:46
  • 1
    to test out scrapes in the future, run `scrapy shell` in the terminal as it will emulate what the crawler does – Dom Weldon Jul 01 '17 at 00:47
  • 1
    You can get all performances dates like name, data, program, url etc if you make a `POST` request to `https://www.clevelandorchestra.com/Services/PerformanceService.asmx/GetToolTipPerformancesForCalendar` with body of request `{"startDate":"2017-06-30T21:00:00.000Z","endDate":"2017-12-31T21:00:00.000Z"}`. This request returns you a response in Json format with all data. – vold Jul 01 '17 at 13:25
  • Thanks to both of you. @vold, can you tell me how you figured out how to do this? – NFB Jul 01 '17 at 14:48
  • 1
    @NFB If you open browser dev tools in the `Network` tab and refresh the page you can see an ajax request http://icecream.me/ad51130bf218ab24ba90325a5994070a Now just construct this request with `Scrapy` or `requests`. – vold Jul 01 '17 at 15:01
  • Thanks. I've gotten a response but it's not a JSON object - it appears to be HTML in UTF 8, and when trying to use json() to parse it, I get "json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)" . I realize we are way off the subject of the original question/title (HTML tags), and probably now a duplicate (?) but if one of you has a moment to tell me how to get the rest of the way to learn how to do this, I would be much obliged. I do see the JSON object in the Network tab, just don't know how to get it in a response. – NFB Jul 02 '17 at 00:53

1 Answers1

2

This content loaded via Ajax call. In order to get data, you need to make similar POST request and don't forget to add headers with content type: headers = {'content-type': "application/json"} and you get Json file in response.enter image description here

import requests

url = "https://www.clevelandorchestra.com/Services/PerformanceService.asmx/GetToolTipPerformancesForCalendar"
payload = {"startDate": "2017-06-30T21:00:00.000Z", "endDate": "2017-12-31T21:00:00.000Z"}
headers = {'content-type': "application/json"}

json_response = requests.post(url, json=payload, headers=headers).json()
for performance in json_response['d']:
    print(performance["performanceName"], performance["dateString"])

# Star-Spangled Spectacular Friday, June 30, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Saturday, July 1, 2017
# Blossom: Tchaikovskys Spectacular 1812 Overture Sunday, July 2, 2017
# Blossom: A Salute to America Monday, July 3, 2017
# Blossom: A Salute to America Tuesday, July 4, 2017
vold
  • 1,549
  • 1
  • 13
  • 19
  • This works perfectly, thank you. I've changed the title of the question to reflect the outcome and answer. – NFB Jul 02 '17 at 15:06
  • 1
    No problem, I'm glad I could help you. I did my example with `requests` if you want to use Scrapy you can use code from [this question](https://stackoverflow.com/questions/30342243/send-post-request-in-scrapy). – vold Jul 02 '17 at 15:22