1

So I'm trying to scrape the schedule at this page.. http://stats.swehockey.se/ScheduleAndResults/Schedule/3940

..with this code.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    rows = hxs.select('//table[@class="tblContent"]/tbody/tr')

    for row in rows:
        date = row.select('/td[1]/div/span/text()').extract()
        teams = row.select('/td[2]/text()').extract()

        print date, teams

But I can't get it to work. What am I doing wrong? I've been trying to figure out myself for a couple of hours now but I have no idea why my XPath doesn't work properly.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195

1 Answers1

1

Two problems:

  • tbody is a tag that is added by modern browsers. Scrapy simply doesn't see it in the html.

  • xpaths for data and teams weren't right: you should use relative xpath (.//), also td indexes was wrong, should be 2 and 3 instead of 1 and 2

Here's the whole code with some mofidications (working):

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            yield item

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks a lot! I'm new to both Python and Scrapy, so I guess I have a couple of things to figure out left. What's next is to break out the date and teams into a Google Calender format and filter it to only add home games of AIK or Djurgårdens IF. Would you mind helping me with that so I can have an example to look at in the future? – user2624679 Sep 13 '13 at 21:01
  • You are welcome. Sure, consider asking a separate question, but be sure you are specific, see [how-to-ask](http://stackoverflow.com/help/how-to-ask). – alecxe Sep 13 '13 at 21:07
  • Ok! I've made a new question (http://stackoverflow.com/questions/18795387/finish-my-novice-project-to-let-me-learn-from-the-example). I would gladly accept your help if you have the time. Cheers! – user2624679 Sep 13 '13 at 21:16
  • Regarding the `` tags: http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the – Jens Erat Sep 13 '13 at 22:02