5

I am mentioning only SOME of the questions that I have referred before posting this question (I currently don't have links to all of those questions that I had referred to, before posting this question)-:

I am able to run this code completely, if I don't pass the arguments and ask for an input from the user from the BBSpider Class (without the main function - ust below the name="dmoz" line), or provide them as pre-defined (i.e, static) arguments.

My code is here.

I am basically trying to execute a Scrapy spider from a Python Script without the requirement of any additional files (even the Settings File). That is why, I have specified the settings also inside the code itself.

This is the output that I am getting on executing this script-:

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

The problems that I am currently facing-:

  • If you carefully see Line 1 and Line 6 of my output, the start_url that I passed to my spider got printed twice, even though I have written the print statement only once on Line 31 of my code (whose link that I gave above). Why is that happening, and that too with different values (Initial print statement output on Line 1 (of my output) gives the correct result, although the print statement output on Line 6 (of my output)? Not only this, even if i write - print 'hi' - then also it gets printed twice. Why is this happening?
  • Next, if you see this line of my output-: TypeError: Request url must be str or unicode, got NoneType: Why is that coming (even though the links of the questions that I posted above, have written the same thing) ? I have no idea how to resolve it? I even tried `self.start_urls=[str(kwargs.get('start_url'))]` - then it gives the following output-:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

Please help me resolve the above 2 errors.

Community
  • 1
  • 1
Ashutosh Saboo
  • 364
  • 1
  • 8
  • 16
  • have you checked this answer? [How to run Scrapy from within a Python script](http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script) – eLRuLL Jun 26 '15 at 11:37
  • @eLRuLL: Yes, I have checked them. First thing, there it isn't mentioned what changes that need to be done in the class of the spider (which is the main core of my problem - Both of my issues that I listed above lie in that part of code only). Another thing, what they said is the exact similar thing that I have done (if you see my code) while calling the spider to crawl. Please do let me know as to how to resolve this! Thanks! – Ashutosh Saboo Jun 26 '15 at 11:47

1 Answers1

11

You need to pass your parameters on the crawl method of the CrawlerProcess, so you need to run it like this:

crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Thanks it did work perfectly. Just one clarification and a doubt. Why was that issue 1 happening (as in it got printed twice)? And the doubt - If i want to execute 2 spiders in parallel using the multiprocessing library, can I pass a queue like this, and then use queue.put(items) and then finally access the output of the spider from the main function of the script using the queue.get() method. Is it possible to do that? Can you give me a sample code that does that? It would be really grateful of you if you could provide me that code. Thanks, please do provide that code. – Ashutosh Saboo Jun 26 '15 at 12:29
  • well the duplicated print happened because you instantiated a Spider object before calling the crawler, so that's the first print, and then you passed a Spider instance on the crawler, which didn't get any parameters, so that's the second print. About the second one, I think that could be possible, but I don't have an example now sorry. – eLRuLL Jun 26 '15 at 12:45
  • Thanks a lot for your response. You have cleared my doubt. For the second part, could you try helping me, by providing me the code for multiprocessing (using the python - multiprocessing library) 2 spiders of the same BBSpider Class for 2 different start_urls? I tried it but it is giving me some weird error. It would be great if you could provide me the code for it! I would be grateful of you, if you could provide the code. Please do try to give the code. Thanks! – Ashutosh Saboo Jun 26 '15 at 13:02
  • Also, I tried to check several similar questions related to my doubt that I asked above, but none of them seem to work. That is why, I thought if you could help, then I would be really grateful of you. Since I have learnt Scrapy recently (a couple of days back), that is why I had this doubt. Please do try and help. Thanks! – Ashutosh Saboo Jun 26 '15 at 13:12
  • maybe create a different question and I (and more people) can help with that. – eLRuLL Jun 26 '15 at 13:40
  • Yes sure. Will do it soon, and post the link here. Thanks btw! It would be best if you keep your code ready, so that you could answer it as soon as I post the question. Thanks! :) – Ashutosh Saboo Jun 26 '15 at 13:42
  • Please answer this question too - http://stackoverflow.com/questions/31087268/multiprocessing-of-scrapy-spiders-in-parallel-processes ! Thanks! – Ashutosh Saboo Jun 27 '15 at 09:49
  • Please answer to that question too (the link that I posted above). I still haven't found a constructive solution for that problem. So, please do help me out on the question. I would be really grateful of you. Thank you! :) – Ashutosh Saboo Jun 30 '15 at 04:32
  • How to access the arguments from within the spider? – PlsWork May 17 '19 at 15:51
  • @AnnaVopureta as direct class arguments, check [this answer](https://stackoverflow.com/a/41123138/858913) – eLRuLL May 17 '19 at 15:59