0

I am trying to scrape some data from a list of urls for example http://basketball.realgm.com/international/league/12/French-LNB-Pro-A/teams to pull all of the team names. Below is my spider, it is running through the URLs but not taking any data?

from scrapy.spider import Spider
from scrapy.selector import HtmlXPathSelector
from teams.items import TeamsItem

class TeamsSpider(Spider):
    name = "teamcrawler"
    allowed_domains = ["basketball.realgm.com"]
    f = open("teamurls.txt")
    start_urls = [url.strip() for url in f.readlines()]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("/html/body/div[1]/div[2]/table/tbody/tr/td/div[2]/table/tbody/tr")
        items = []
        for title in titles:
            item = TeamsItem()
            item["URL"] = title.select("td[1]/a/@href").extract()
            item["Team"] = title.select("td[1]/a/text()").extract()
            items.append(item)
        print items
        return items
RoryC
  • 1

1 Answers1

0

Your XPath is failing because of the tbody in the XPath. Browsers (like Firefox and Chrome) will add that node to tables if it isn't present in the source of the page.

Since the tbody node might or might not be in the source of the page, you can use scrapy shell to perform interactive debugging with what scrapy sees. Usage: scrapy shell 'http://www.example.org'

Related question: Parsing HTML with XPath, Python and Scrapy

Community
  • 1
  • 1
PlasmaSauna
  • 235
  • 1
  • 5