Scrapy Callback Function not scraping the Entire Data?

Question

First of all this is my code-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
#from scrapy import log, signals
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

#query=raw_input("Enter a product to search for= ")
query='apple'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()
    add = scrapy.Field()

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bestmercato.com"]


    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp=("https://www.bestmercato.com/index.php?route=product/search&search="+query1+"&page="+str(i))
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
#       p=len(task_urls)
        return [ Request(url = start_url) for start_url in start_urls ]


    def parse(self, response):
        items = []

        for sel in response.xpath('//html/body/div/div/div[4]/div/div/div[5]/div'):

            item = DmozItem()

            item['productname'] = str(sel.xpath('div[@class="product-thumb"]/div[@class="small_detail"]/div[@class="name"]/a/text()').extract())[3:-2]

            item['product_link'] = str(sel.xpath('div[@class="product-thumb"]/div[@class="small_detail"]/div[@class="name"]/a/@href').extract())[3:-2]

            point1 = sel.xpath('div[@class="product-thumb"]/div[@class="small_detail"]/div[4]').extract()
            point = str(sel.xpath('div[@class="product-thumb"]/div[@class="small_detail"]/div[4]/@class').extract())[3:-2]
            checker = "options" in point
            item['current_price'] = ""
            if checker:
                i=1
                p=1
                while i==1:
                    t = str(sel.xpath('div[@class="product-thumb"]/div[@class="small_detail"]/div[4]/div/select/option['+str(p)+']/text()').extract())[3:-2]
                    #print t        
                    if 'Rs' not in t:
                        i = 2
                    elif 'Rs' in t:
                        i = 1
                    t= " ".join(t)
                    s = t.translate(None, '\ t')[:-2]
                    item['current_price'] = item['current_price'] + ' ; ' + s
                    p = p+1
                item['current_price'] = item['current_price'][3:-3]

            else:
                item['current_price'] = 'Rs. ' + str(sel.xpath('div[@class="product-thumb"]/div[@class="small_detail"]/div[not (@class="name") or not(@class="description") or not(@class="qty") or not(@class="box_btn_icon")]/text()').extract())[46:-169]
                re.findall(r"[-+]?\d*\.\d+|\d+", item["current_price"])

            try:
                test1 = str(sel.xpath('div/div[2]/div[3]/span[1]/text()').extract())[3:-2]
                _digits = re.compile('\d')
                if bool(_digits.search(test1)):
                    print 'hi'
                    test1=test1[:2]+'. '+test1[3:]
                    item['mrp'] = test1
                    #item['mrp'][2:2]='.'
                    test2 = str(sel.xpath('div/div[2]/div[3]/span[2]/text()').extract())[3:-2]
                    test2=test2[:2]+'. '+test2[3:]
                    item['current_price']=test2

                else:
                    item['mrp'] = item['current_price']                 
            except:
                item['mrp'] = item['current_price']

            item['offer'] = 'No additional offer available'

            item['imageurl'] = str(sel.xpath('div[@class="product-thumb"]/div[@class="image"]/a[not (@class="sft_quickshop_icon")]/img[@class="img-responsive"]/@src').extract())[3:-2]

            item['outofstock_status'] = str('In Stock')

            request = Request(str(item['product_link']),callback=self.parse2, dont_filter=True)
            request.meta['item'] = item
#           print item
            items.append(item)
            return request

        print (items)

    def parse2(self, response):

        item = response.meta['item']
        item['add'] = response.url
        return item

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("CONCURRENT_REQUESTS" , 100)
#)
#settings.set( "DEPTH_PRIORITY" , 1)
#settings.set("SCHEDULER_DISK_QUEUE" , "scrapy.squeues.PickleFifoDiskQueue")
#settings.set( "SCHEDULER_MEMORY_QUEUE" , "scrapy.squeues.FifoMemoryQueue")
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

Now, these are the issues that I am facing.

1. There are numerous divs that can be found with this xpath - '//html/body/div/div/div[4]/div/div/div[5]/div' . However, the above code scrapes the contents only of the first div , i.e , having the xpath 'html/body/div/div/div[4]/div/div/div[5]/div[1]' , and not all of them.

The moment I comment these three lines, the scraper scrapes everything, obviously then I am not able to add the 'add' field in the item-:

request = Request(str(item['product_link']),callback=self.parse2, dont_filter=True)
request.meta['item'] = item
return request

So, I want to scrape all the divs , in addition with the 'add' field in my item Class (notice the class DmozItem). How do I do that? Please give a corrected code for my SPECIFIC case, it would be best that way!

2. Secondly, as I said, as I comment the three lines, that I mentioned above, then the program scrapes everything in a time close to 5 seconds (around 4.9 seconds).

But as soon as I un-comment, those 3 lines (again those that I mentioned above), the program's run-time exceeds drastically, and it runs in a time close to 9 seconds (around 8.8 - 8.9 seconds). Why does this happen? Is that because of this - dont_filter=True? Please suggest ways to overcome this, as the run-time can prove to be a very big problem for me. Also, can I decrease the initial time of 5 seconds (around 4.9) somehow?

score 2 · Answer 1 · edited Jul 13 '15 at 05:48

2

Use html/body/div/div/div[4]/div/div/div[5]//div to get all divs after div[5].

EDIT: This is the correct xpath - //html/body/div/div/div[4]/div/div/div[5]/div, that gave all the div's after div[5]. The previous one mentioned, gave multiple errors!

If you do a return statement inside the loop you end the whole method execution. So if you enable those three lines you end the execution of your method (and the for loop) after the first element.

This means you should yield your request instead of returning it.

edited Jul 13 '15 at 05:48

sudhansu63

6,025
4
39
52

answered Jul 13 '15 at 05:15

GHajba

3,665
5
25
35

Actually, your answer did solve the problem for me. But the problem that happens is, my program first extracts data for all the products in the main parse() function prints them, and then prints all of it. After that, it goes one by one in the function parse2() and then displays them consequently. But as it goes one, by one, the program pauses a lot. :( . That leads to a sever increase in the run-time. :( The run-time increased to a massive 39 seconds. How do I overcome that? That is a massive problem for me. :( So, please help me with that. :) – Ashutosh Saboo Jul 13 '15 at 05:28
Can I reduce on the huge run-time, that is occurring? Please someone help. It's a matter of concern. Please, help. @GHajba , or any one else could also help! – Ashutosh Saboo Jul 13 '15 at 06:05
Well, because you `yield` new requests Scrapy has extra work to do. I do not know how many sites you are parsing but if there are many it can impact the performance of the scraper even with caching enabled. And there are some other performance tuning methods you can use to speed up the overall performance of the app not just that one part. – GHajba Jul 13 '15 at 06:18
Ohh.. Is there some material available for those performance tuning methods either somewhere online? Or could you elaborate about them maybe in a brief way. This is one site that I scraped. I am planning to scrape maybe around 40 of those and run them from a single script. So, that would cause a huge problem right? So, maybe if you could elaborate a bit on the performance tuning methods for scrapy that would be best for me as of now! Thanks a lot! Please do reply! :) This is a very big problem for me, that is the reason! :( – Ashutosh Saboo Jul 13 '15 at 06:24
You can find almost every answer on StackOverflow ;) http://stackoverflow.com/a/17030686/3941341 – GHajba Jul 13 '15 at 14:14

Scrapy Callback Function not scraping the Entire Data?

1 Answers1