0

I'm working on Scrapy for the first time and I can't get this to return anything. Can someone help me understand what I'm doing wrong?

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from idcode.items import StatuteItem

class IdCodeSpider(BaseSpider):
  name = "idcode"
  allowed_domains = ["idaho.gov"]
  start_urls = ["http://legislature.idaho.gov/idstat/Title1/T1CH1SECT1-101.htm"]

  def parse(self, response):
    hxs = HtmlXPathSelector(response)
    item = StatuteItem()
    item['title'] = hxs.select("//table/tbody/tr[1]/td[2]/div[2]/div[1]/div[1]/text()").extract()
    return item

I know everything else in my project is working because if I add item['title'] = "test" above return item it returns "test". So I must have something wrong with my XPath, but I tested that in the Chrome Developer Console and it's working there.

Splendor
  • 1,386
  • 6
  • 28
  • 62
  • You should give us also the HTML code to verify your xpath – Arup Rakshit Sep 29 '13 at 06:37
  • The url is http://legislature.idaho.gov/idstat/Title1/T1CH1SECT1-101.htm – Splendor Sep 29 '13 at 06:49
  • Duplicate of [Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?](http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the). Additionally, that site has horrible bad markup, which gets parsed differently by different HTML-to-XML-parsers. Try to construct the XPath manually or dump the parsed XML to construct the XPath. And: All you gave us was a path which does not work as intended; what part of the page do you need? – Jens Erat Sep 29 '13 at 10:17
  • I'm after the text inside `
    `.
    – Splendor Sep 29 '13 at 12:35

2 Answers2

1

Removing tbody resolved the issue.

item['title'] = hxs.select("//table/tr[1]/td[2]/div[2]/div[1]/div[1]/text()").extract()
Splendor
  • 1,386
  • 6
  • 28
  • 62
0

If you want to use the code and not only to create it, you can use Goose project. It is only for text and media but I have used it many times and I don't have any problem.

Here is the link:

https://github.com/grangier/python-goose

Tasos
  • 7,325
  • 18
  • 83
  • 176