Yeah, everytime I'm grabbing a link I have to use the method urlparse.urljoin.
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
I imagine your trying to grab the entire url to parse it right? if that's the case a simple two method system would work on a basespider. the parse method finds the link, sends it to the parse_url method which outputs what you're extracting to the pipeline
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this grabs it
return item