0

I am using Scrapy 1.2 with Xpath (and of course: python 3.4) to read the Hot 100 chart on billboard.com. I get all 100 titles for each song when I use the second option in the code. I get that's because of the double /; but I cannot make the first option work. How can I make sure that I get only the right title for each song?

class MusicalSpider(scrapy.Spider):
    name = "musicalspider"
    allowed_domains = ["billboard.com"]
    start_urls = ['http://www.billboard.com/charts/hot-100/']

    def parse(self, response):
        songs = response.xpath('//div[@class="chart-data js-chart-data"]/div[@class="container"]/article')

        for song in songs:
            item = MusicItem()
            # first option:
            item['title'] = song.xpath('div[@class="chart-row__primary"]/div[@class="chart-row__main-display"]/div[@class="chart-row__container"]/div[@class="chart-row__title"]/h2[@class="chart-row__song"]').extract()
            # second option:
            item['title'] = song.xpath('//h2[@class="chart-row__song"]').extract()

            yield item
Celebrian
  • 410
  • 2
  • 6
  • 13

1 Answers1

3

This is quite a common problem. Remember to start your inner-loop XPath expressions with a dot - this would make them context-specific:

for song in songs:
    item = MusicItem()
    # first option:
    item['title'] = song.xpath('.//div[@class="chart-row__primary"]/div[@class="chart-row__main-display"]/div[@class="chart-row__container"]/div[@class="chart-row__title"]/h2[@class="chart-row__song"]').extract()
    # second option:
    item['title'] = song.xpath('.//h2[@class="chart-row__song"]').extract()

    yield item

See more at:


Here is the spider that works for me:

import scrapy

class MusicalSpider(scrapy.Spider):
    name = "musicalspider"
    allowed_domains = ["billboard.com"]
    start_urls = ['http://www.billboard.com/charts/hot-100/']

    def parse(self, response):
        songs = response.xpath('//div[@class="chart-data js-chart-data"]/div[@class="container"]/article')

        for song in songs:
            item = MusicItem()
            item['title'] = song.xpath('.//h2[@class="chart-row__song"]/text()').extract_first()
            yield item

It produces the following items:

{'title': u'Black Beatles'}
{'title': u'Closer'}
...
{'title': u'Hold Up'}
{'title': u'Gangsta'}
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • No, both of your options give me empty lists – Celebrian Nov 15 '16 at 16:17
  • 2
    @user7162453: Answer is essentially correct, but your XPaths may have additional problems. – kjhughes Nov 15 '16 at 16:17
  • 1
    @user7162453 the idea is correct, the xpath expressions themselves should probably be adjusted, let me test that. – alecxe Nov 15 '16 at 16:18
  • 1
    @user7162453 okay, I've updated the answer and posted the spider I'm currently running and the output items. Check it out, hope that helps. – alecxe Nov 15 '16 at 16:22