2

I'm parsing online shops using scrapy and python-requests, and after i get all the info i'm making one more request to get qty by python-requests, and after several minutes spider stops working I dont know what is causing the trouble. Any suggestions?

Scrapy Log:

2014-05-08 15:27:57+0300 [scrapy] DEBUG: Start adding sku1270594 to a cart.
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.sds.com.au
DEBUG:requests.packages.urllib3.connectionpool:"GET /product/trefoil-tee-by-adidas-in-black-camo-grey HTTP/1.1" 200 20223
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.sds.com.au
DEBUG:requests.packages.urllib3.connectionpool:"POST /common/ajaxResponse.jsp;jsessionid=34E95C7662D0F5084FF971CC5693E6E8.store-node1?_DARGS=/browse/product.jsp.addToCartForm HTTP/1.1" 200 146
2014-05-08 15:27:59+0300 [scrapy] DEBUG: End adding sku1270594 to a cart.
2014-05-08 15:27:59+0300 [scrapy] DEBUG: Success. quantity of sku1270594 is 16.
2014-05-08 15:28:00+0300 [sds] DEBUG: Updating  product info sku1270594
2014-05-08 15:28:00+0300 [sds] DEBUG: Added new price sku1270594
2014-05-08 15:28:00+0300 [sds] DEBUG: Scraped from <200 http://www.sds.com.au/product/trefoil-tee-by-adidas-in-black-camo-grey>
2014-05-08 15:28:00+0300 [sds] DEBUG: Updating  product info sku901159
2014-05-08 15:28:00+0300 [sds] DEBUG: Added new price sku901159
2014-05-08 15:28:00+0300 [sds] DEBUG: Scraped from <200 http://www.sds.com.au/product/two-palm-tee-by-folke-in-chalk>
2014-05-08 15:28:00+0300 [sds] DEBUG: Updating  product info sku901163
2014-05-08 15:28:00+0300 [sds] DEBUG: Added new price sku901163
2014-05-08 15:28:00+0300 [sds] DEBUG: Scraped from <200 http://www.sds.com.au/product/two-palm-tee-by-folke-in-chalk>
2014-05-08 15:28:00+0300 [scrapy] DEBUG: Start adding sku1270591 to a cart.
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.sds.com.au
DEBUG:requests.packages.urllib3.connectionpool:"GET /product/trefoil-tee-by-adidas-in-black-camo-grey HTTP/1.1" 200 20225
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.sds.com.au

And that's it.Nothing happens in console anymore. Here's the function that gets the quantity:

def get_qty(self, item):
    r = requests.get(item['url'])
    cookie_cart_user = dict(r.cookies)
    sel = Selector(text=r.text, type="html")
    session = sel.xpath('//input[@name="_dynSessConf"]/@value').extract()[0]
    # print session
    # print cookie_cart_user
    add_to_cart_url = 'http://www.sds.com.au/common/ajaxResponse.jsp;jsessionid=%s?_DARGS=/browse/product.jsp.addToCartForm' % cookie_cart_user['JSESSIONID']
    # ok, so we're adding one item
    log.msg("Adding %s to a cart." % item['internal_id'], log.DEBUG)
    headers = {
        'User-Agent': USER_AGENT,
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Connection': 'close',
    }
    s = requests.session()
    s.keep_alive = False
    r = requests.post(add_to_cart_url,
                      data=self.generate_form_data(item, 10000, session),
                      cookies=cookie_cart_user,
                      headers=headers,
                      timeout=10)
    response = r.json()
    r.close()
    try:
        quantity = int(re.findall(u'\d+', response['formErrors'][0]['errorMessage'])[0])
        log.msg("Success. quantity of %s is %s." % (item['internal_id'], quantity), log.DEBUG)
        return quantity
    except Exception, e:
        log.msg('Error getting data-cart-item on product %s. Error: %s' % (item['internal_id'], str(e)), log.ERROR)
        with open("log/%s.html" % item['internal_id'], "w") as myfile:
            myfile.write('%s' % r.text.encode('utf-8'))
Tim
  • 41,901
  • 18
  • 127
  • 145
Vladimir Tsyupko
  • 163
  • 2
  • 17
  • When you rerun your script, does it work immediately (at least first few requests to the site), or it does not work for a while? It is possible, the site decides to serve you as you are having higher rate of requests. This might be true even if restart of your script works well, as blocking the requests might be related to established session id. – Jan Vlcinsky May 06 '14 at 17:55
  • Yeah, when i rerun its the same.It works for a while (8 minutes max) and stops. Jan, do you have any suggestions on how should i try to fix this? Thanks in advance. – Vladimir Tsyupko May 06 '14 at 18:24
  • @user32223824 Check, if the site declares some request rates. If so, try to follow them, possibly adding some `time.sleep()` between your requests – Jan Vlcinsky May 06 '14 at 18:31
  • @JanVlcinsky Yeah, it helped a bit, but spider stops anyway, i'm almost sure that its the issue with requests library. – Vladimir Tsyupko May 07 '14 at 10:05
  • 1
    `requests` is nowadays quite stable. Anyway, you shall enable detailed logging from requests itself and see more. Instructions are here: http://stackoverflow.com/a/16337639/346478 – Jan Vlcinsky May 07 '14 at 10:37
  • @JanVlcinsky i've enabled logging, and updated log in my post, there's no errors. – Vladimir Tsyupko May 08 '14 at 12:37
  • 1
    Good. Now we can see, it gets stuck at http request. It starts the request, but does not conclude. Try adding `timeout` to your request, as described here http://docs.python-requests.org/en/latest/user/quickstart/?highlight=timeout#timeouts . See also http://stackoverflow.com/questions/17782142/why-requests-get-doesnt-return-what-is-the-default-timeout-that-get-uses – Jan Vlcinsky May 08 '14 at 13:09
  • @Jan Vlcinsky, big thank you! Seems like adding max_retries is the best option for me, since i need to get the data anyway. – Vladimir Tsyupko May 08 '14 at 13:15

1 Answers1

2

Well, Jan Vlcinsky recommended to go deep into logging of requests, and after some digging i've decided to re-organize my code a little bit, which gave me the right answer, and now everything works great.

def get_qty(self, item):
    log.msg("Start adding %s to a cart." % item['internal_id'], log.DEBUG)
    logging.basicConfig(level=logging.DEBUG)
    sess = requests.Session()
    sess.keep_alive = False
    adapter = HTTPAdapter(max_retries=50)
    sess.mount('http://', adapter)
    r = sess.get(item['url'])
    cookie_cart_user = dict(r.cookies)
    sel = Selector(text=r.text, type="html")
    session = sel.xpath('//input[@name="_dynSessConf"]/@value').extract()[0]
    add_to_cart_url = 'http://www.sds.com.au/common/ajaxResponse.jsp;jsessionid=%s?_DARGS=/browse/product.jsp.addToCartForm' % cookie_cart_user['JSESSIONID']
    headers = {
        'User-Agent': USER_AGENT,
        'Accept': 'application/json, text/javascript, */*; q=0.01',
    }
    r = sess.post(add_to_cart_url,
                      data=self.generate_form_data(item, 10000, session),
                      cookies=cookie_cart_user,
                      headers=headers,
                      )
    log.msg("End adding %s to a cart." % item['internal_id'], log.DEBUG)
    try:
        response = r.json()
        r.close()
        quantity = int(re.findall(u'\d+', response['formErrors'][0]['errorMessage'])[0])
        log.msg("Success. quantity of %s is %s." % (item['internal_id'], quantity), log.DEBUG)
        return quantity
    except Exception, e:
        log.msg('Error getting data-cart-item on product %s. Error: %s' % (item['internal_id'], str(e)), log.ERROR)
        with open("log/%s.html" % item['internal_id'], "w") as myfile:
            myfile.write('%s' % r.text.encode('utf-8'))

And now if error occurs log says

2014-05-08 16:00:10+0300 [scrapy] DEBUG: Start adding sku1210352 to a cart.
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.sds.com.au
WARNING:requests.packages.urllib3.connectionpool:Retrying (50 attempts remain) after connection broken by 'error(60, 'Operation timed out')': /product/startlet-gilet-fleece-jacket-by-zoo-york-in-black
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (2): www.sds.com.au
DEBUG:requests.packages.urllib3.connectionpool:"GET /product/startlet-gilet-fleece-jacket-by-zoo-york-in-black HTTP/1.1" 200 20278
DEBUG:requests.packages.urllib3.connectionpool:"POST /common/ajaxResponse.jsp;jsessionid=EEA02CE768B288DD302896F6A8C4780F.store-node2?_DARGS=/browse/product.jsp.addToCartForm HTTP/1.1" 200 145
2014-05-08 16:01:14+0300 [scrapy] DEBUG: End adding sku1210352 to a cart.

And after that it retying, and continue like nothing happend

Vladimir Tsyupko
  • 163
  • 2
  • 17
  • Oh, it's stopped again, and not retrying... alas. – Vladimir Tsyupko May 08 '14 at 13:35
  • It seems, like the problem is on server side rather than in your client code. If you can, consult admins of the web app and you might find, there must be followed some rules to get access in more request in sequence. Or their service is simply shaky and your house is to be built on this sandy base. – Jan Vlcinsky May 08 '14 at 13:53