1

I am trying to scrape data from the html table, Texas Death Row

I able to pull the existing data from the table using the spider script below:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from texasdeath.items import DeathItem

class DeathSpider(BaseSpider):
   name = "death"
   allowed_domains = ["tdcj.state.tx.us"]
   start_urls = [
       "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
   ]



   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//table/tbody/tr')
       for site in sites:
           item = DeathItem()
           item['firstName'] = site.select('td[5]/text()').extract()
           item['lastName'] = site.select('td[4]/text()').extract()
           item['Age'] = site.select('td[7]/text()').extract()
           item['Date'] = site.select('td[8]/text()').extract()
           item['Race'] = site.select('td[9]/text()').extract()
           item['County'] = site.select('td[10]/text()').extract()
           yield item

Problem is there also links in the table that I am trying to call and get the data from within the links to be appended to my items.

The Scrapy tutorial here, Scrapy Tutorial seems to have a guide on how to pull data from within a directory. But I am having trouble figuring out how to do get the data from the main page as well as to return me data from links within the table.

BernardL
  • 5,162
  • 7
  • 28
  • 47

1 Answers1

1

Instead of yielding an item, yield a Request and pass the item inside meta. This is covered in the documentation here.

Sample implementation of a spider that would follow the "Offender Information" links if it leads to the offender "details" page (sometimes it leads to an image - in this case the spider would output what it has at the moment):

from urlparse import urljoin

import scrapy


class DeathItem(scrapy.Item):
    firstName = scrapy.Field()
    lastName = scrapy.Field()
    Age = scrapy.Field()
    Date = scrapy.Field()
    Race = scrapy.Field()
    County = scrapy.Field()
    Gender = scrapy.Field()


class DeathSpider(scrapy.Spider):
    name = "death"
    allowed_domains = ["tdcj.state.tx.us"]
    start_urls = [
        "https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
    ]

    def parse(self, response):
        sites = response.xpath('//table/tbody/tr')
        for site in sites:
            item = DeathItem()

            item['firstName'] = site.xpath('td[5]/text()').extract()
            item['lastName'] = site.xpath('td[4]/text()').extract()
            item['Age'] = site.xpath('td[7]/text()').extract()
            item['Date'] = site.xpath('td[8]/text()').extract()
            item['Race'] = site.xpath('td[9]/text()').extract()
            item['County'] = site.xpath('td[10]/text()').extract()

            url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
            if url.endswith("html"):
                yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
            else:
                yield item

    def parse_details(self, response):
        item = response.meta["item"]
        item["Gender"] = response.xpath("//td[. = 'Gender']/following-sibling::td[1]/text()").extract()
        yield item
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I have also read that documentation. Sadly, I do not seem to understand the flow of it. In my code above, response seem to return the main page and item, the fields in the page. But for the example in the documentation, I am unsure where to define the links and also the flow of it. – BernardL May 16 '16 at 15:58
  • 1
    @user3288092 okay, no problem, updated with a sample spider. Check it out. – alecxe May 17 '16 at 01:41
  • @alexcxe thanks a bunch, I was making my way to that solution, makes sense that the request had to be created with urljoin. Anyways, I tried to extract another snippet in the other link using; `url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())`. And from the request using the `xpath response.xpath("//p[6]").extract()`. I am being returned with a 407, some of the fields are populated but not with the data i expect. Any ideas? – BernardL May 17 '16 at 03:40
  • @alexcxe found out! Used xpaths selectors to help me get the correct part. Do you think you will be able to help out here? https://stackoverflow.com/questions/37272407/scrapy-extracting-data-from-source-and-its-links I am trying to get data from both links. Not one. Much appreciated! – BernardL May 17 '16 at 11:24