1

Possible Duplicate:
Scrapy Modify Link to include Domain Name

I use this code to extract data from html website and i stored the data in XML file and it works great with me.

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []
    site1 = hxs.select('/html/body/div/div[4]/div[3]/div/div/div[2]/div/ul/li')
    for site in site1:
        item = NewsItem()

        item ['title'] = site.select('a[2]/text()').extract()
        item ['image'] = site.select('a/img/@src').extract()
        item ['text'] = site.select('p/text()').extract()
        item ['link'] = site.select('a[2]/@href').extract()


        items.append(item)

    return items

but the issue that i am facing is the website provide a short link for ['image'] which like this:

<img src="/a/small/72/72089be43654dc6d7215ec49f4be5a07_w200_h180.jpg"

while the full link should be like this:

<img src="http://www.aleqt.com/a/small/72/72089be43654dc6d7215ec49f4be5a07_w200_h180.jpg"

I want to know how to modify my code to add the missing link automatically

Community
  • 1
  • 1
user1909176
  • 115
  • 6
  • Do you really need a full url? Why? Relative links have some advantages above full urls inside web applications. – arkascha Jan 21 '13 at 09:41
  • Finding out the domain name required to construct a full url from a relative link is not a trivial task. If the settings of your web page do not provide such information, meaning you have to detect it, you will almost certainly run into situations where you guess wrong. Because all you can rely on is the request information (which might result from an internal request, maybe to localhost) or the systems network configuration (which often is not distinct). – arkascha Jan 21 '13 at 09:45

2 Answers2

1

You can try this

item ['link'] = urljoin(response.url, site.select('a[2]/@href').extract())

Mirage
  • 30,868
  • 62
  • 166
  • 261
0

On the assumption that all such image links simply need "http://www.aleqt.com" added to them, you could just do something like this:

def parse(self, response):
    base_url = 'http://www.aleqt.com'
    hxs = HtmlXPathSelector(response)
    items = []
    site1 = hxs.select('/html/body/div/div[4]/div[3]/div/div/div[2]/div/ul/li')
    for site in site1:
        item = NewsItem()    
        item ['title'] = site.select('a[2]/text()').extract()
        item ['image'] = base_url + site.select('a/img/@src').extract()
        item ['text'] = site.select('p/text()').extract()
        item ['link'] = base_url + site.select('a[2]/@href').extract()
        items.append(item)
    return items

Alternatively, if you've added that exact same url to the start_urls list (and assuming there's only one, you could replace base_url with self.start_urls[0]

Talvalin
  • 7,789
  • 2
  • 30
  • 40