20

I have amended the code based on solutions offered below by the great folks here; I get the error shown below the code here.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from dmoz2.items import DmozItem

class DmozSpider(BaseSpider):
   name = "namastecopy2"
   allowed_domains = ["namastefoods.com"]
   start_urls = [
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1",
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12",    

]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html/body/div/div[2]/table/tr/td[2]/table/tr')
    items = []
    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        item['productname'] = site.select('td/h1/text()').extract()
        item['description'] = site.select('//*[@id="info-col"]/p[7]/strong/text()').extract()
        item['ingredients'] = site.select('td[1]/table/tr/td[2]/text()').extract()
        item['ninfo'] = site.select('td[2]/ul/li[3]/img/@src').extract()
        #insert code that will save the above image path for ninfo as an absolute path
        base_url = get_base_url(response)
        relative_url = site.select('//*[@id="showImage"]/@src').extract()
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
        items.append(item)
    return items

My items.py looks like this:

from scrapy.item import Item, Field

class DmozItem(Item):
    # define the fields for your item here like:
    productid = Field()
    manufacturer = Field()
    productname = Field()
    description = Field()
    ingredients = Field()
    ninfo = Field()
    imagename = Field()
    image_paths = Field()
    relative_images = Field()
    image_urls = Field()
    pass

I need the relative paths that the spider is getting for items['relative_images'] converted to absolute paths & saved in items['image_urls'] so that I can download the images from within this spider itself. For example, the relative_images path that the spider fetches is '../../files/images/small/8270-BrowniesHiResClip.jpg', this should be converted to 'http://namastefoods.com/files/images/small/8270-BrowniesHiResClip.jpg', & stored in items['image_urls']

I also will need the items['ninfo'] path to be stores as an absolute path.

Error when running the above code:

2011-06-28 17:18:11-0400 [scrapy] INFO: Scrapy 0.12.0.2541 started (bot: dmoz2)
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled item pipelines: MyImagesPipeline
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-06-28 17:18:11-0400 [namastecopy2] INFO: Spider opened
2011-06-28 17:18:12-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: None)
2011-06-28 17:18:12-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2011-06-28 17:18:15-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: None)
2011-06-28 17:18:15-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2    011-06-28 17:18:15-0400 [namastecopy2] INFO: Closing spider (finished)
2011-06-28 17:18:15-0400 [namastecopy2] INFO: Spider closed (finished)

Thanks.-TM

warvariuc
  • 57,116
  • 41
  • 173
  • 227
user818190
  • 579
  • 2
  • 11
  • 21

5 Answers5

21

From Scrapy docs:

def parse(self, response):
    # ... code ommited
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, self.parse)

that is, response object has a method to do exactly this.

NefariousOctopus
  • 807
  • 2
  • 10
  • 18
20

What i do is:

import urlparse
...

def parse(self, response):
    ...
    urlparse.urljoin(response.url, extractedLink.strip())
    ...

Notice strip(), because i meet sometimes strange links like:

<a href="
              /MID_BRAND_NEW!%c2%a0MID_70006_Google_Android_2.2_7%22%c2%a0Tablet_PC_Silver/a904326516.html
            ">MID BRAND NEW!&nbsp;MID 70006 Google Android 2.2 7"&nbsp;Tablet PC Silver</a>
warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • 2
    Worth to add that the urls are not concatenated by urljoin(), rather than that url parts like netloc or path are overwritten. Thus `urljoin('http://www.myeshop.com/category/subcategory', '/category/subcategory/item001.php')` does not return `http://www.myeshop.com/category/subcategory/category/subcategory/item001.php` but more sensible `http://www.myeshop.com/category/subcategory/item001.php`. – sumid Mar 07 '14 at 15:24
  • 2
    WARNING: for python 3 according to [doc](https://docs.python.org/2/library/urlparse.html): "The urlparse module is renamed to urllib.parse in Python 3. The 2to3 tool will automatically adapt imports when converting your sources to Python 3." – Rodrigo Laguna Apr 05 '18 at 00:18
  • In Python 3 it becomes: `import urllib.parse` and to use it `urllib.parse.urljoin(response.url, extractedLink.strip())` – 8bitme May 12 '19 at 13:54
7
from scrapy.utils.response import get_base_url

base_url           = get_base_url(response)
relative_url       = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = [urljoin_rfc(base_url,ru) for ru in relative_url]

or you could extract just one item

base_url           = get_base_url(response)
relative_url       = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url,relative_url)

The error was because you were passing a list instead of a str to urljoin function.

user
  • 17,781
  • 20
  • 98
  • 124
  • Thanks @buffer. I tried your code above, & get the following errors:item['image_urls'] = urljoin_rfc(base_url, relative_url) File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc unicode_to_str(ref, encoding)) File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__) exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list – user818190 Jun 28 '11 at 12:05
  • Can you post the code snipped that gave the error (update your question with the code). You are passing an object that is neither string nor unicode hence you get this error. Search for the error here http://dev.scrapy.org/browser/scrapy/utils/python.py?rev=1103 and you'll see what's causing it – user Jun 28 '11 at 13:17
  • Just updated my question, & have included the full error I am getting. Will also check out the link you have included above. Thanks. – user818190 Jun 28 '11 at 21:31
4

Several notes:

items = []
for site in sites:
    item = DmozItem()
    item['manufacturer'] = 'Namaste Foods'
    ...
    items.append(item)
return items

I do it differently:

for site in sites:
    item = DmozItem()
    item['manufacturer'] = 'Namaste Foods'
    ...
    yield item

Then:

relative_url = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = urljoin_rfc(base_url, relative_url)

extract() always returns a list, because an xpath query always returns a list of selected nodes.

Do this:

relative_url = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url, relative_url)
warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • 1
    Don't forget that since 0.14 urljoin_rfc was deprecated since Pablo Hoffman (Scrapy developer) noted that urljoin from urlparse was sufficient. – Sjaak Trekhaak Dec 15 '11 at 08:58
0

A more general approach to obtaining an absolute url would be

import urlparse

def abs_url(url, response):
  """Return absolute link"""
  base = response.xpath('//head/base/@href').extract()
  if base:
    base = base[0]
  else:
    base = response.url
  return urlparse.urljoin(base, url)

This also works when a base element is present.

In your case, you'd use it like this:

def parse(self, response):
  # ...
  for site in sites:
    # ...
    image_urls = site.select('//*[@id="showImage"]/@src').extract()
    if image_urls: item['image_urls'] = abs_url(image_urls[0], response)
wvengen
  • 383
  • 4
  • 10