0

I have an item, item['link'], of this form:

item['link'] = site.select('div[2]/div/h3/a/@href').extract()

The links it extracts are of this form :

'link': [u'/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],

I want them to be this way:

'link': [u'http://www.youtube.com/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'],

Is it possible to do this directly, in scrapy, instead of reediting the list afterwards?

RocketDonkey
  • 36,383
  • 7
  • 80
  • 84
CEFEGE
  • 1
  • 1
  • 2

4 Answers4

2

Yeah, everytime I'm grabbing a link I have to use the method urlparse.urljoin.

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)

I imagine your trying to grab the entire url to parse it right? if that's the case a simple two method system would work on a basespider. the parse method finds the link, sends it to the parse_url method which outputs what you're extracting to the pipeline

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)


def parse_url(self, response):
   hxs = HtmlXPathSelector(response)
   item = ZipgrabberItem()
   item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this grabs it
   return item 
Chris Hawkes
  • 11,923
  • 6
  • 58
  • 68
1

you you really needs link as a list it would be fine for you.

item['link'] = ['http://www.youtube.com%s'%a for a in site.select('div[2]/div/h3/a/@href').extract()]
akhter wahab
  • 4,045
  • 1
  • 25
  • 47
1

No, scrapy doesn't do this for you. According to the standard, URLs in HTML may be absolute or relative. scrapy sees you extracted urls just as data, it cannot know that they are urls, so you must join relative urls manually with the base url.

You need urlparse.urljoin:

Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
>>> import urlparse
>>> urlparse.urljoin('http://www.youtube.com', '/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189')
'http://www.youtube.com/watch?v=1PTw-uy6LA0&list=SP3DB54B154E6D121D&index=189'
>>> 
warvariuc
  • 57,116
  • 41
  • 173
  • 227
1

USE : response.urljoin() There is no such method to extract absolute url directly. You've got to use response.urljoin() and create another parse function that is parsed when with the help of callback. In this second parse function you can extract whatever you wish to.