Issue using Scrapy Spider Output in Python script

Question

I want to use the ouput from a spider inside a python script. To accomplish this, I wrote the following code based on another thread.

The issue I'm facing is that the function spider_results() only returns a list of the last item over and over again instead of a list with all the found items. When I run the same spider manually with the scrapy crawl command, I get the desired output. The output of the script, the manual json output and the spider itself are below.

What's wrong with my code?

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from circus.spiders.circus import MySpider

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)


    dispatcher.connect(crawler_results, signal=signals.item_passed)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

Script output:

{'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}, {'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}, {'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}]

Json output with scrapy crawl:

[
{"home_team": "Los Angeles Angels", "away_team": "Seattle Mariners", "event_time": "2019-06-08 02:07:00", "home_odds": 1.58, "away_odds": 2.4, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Diego Padres", "away_team": "Washington Nationals", "event_time": "2019-06-08 02:10:00", "home_odds": 1.87, "away_odds": 1.97, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Francisco Giants", "away_team": "Los Angeles Dodgers", "event_time": "2019-06-08 02:15:00", "home_odds": 2.85, "away_odds": 1.44, "last_update": "2019-06-06 20:48:16", "league": "MLB"}
]

MySpider:

from scrapy.spiders import Spider
from ..items import MatchItem
import json
import datetime
import dateutil.parser

class MySpider(Spider):
    name = 'first_spider'

    start_urls = ["https://websiteXYZ.com"]

    def parse(self, response):
        item = MatchItem()

        timestamp = datetime.datetime.utcnow()

        response_json = json.loads(response.body)

        for event in response_json["el"]:
            for team in event["epl"]:
                if team["so"] == 1: item["home_team"] = team["pn"]
                if team["so"] == 2: item["away_team"] = team["pn"]

            for market in event["ml"]:
                if market["mn"] == "Match result":
                    item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
                    for outcome in market["msl"]:
                        if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
                        if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
                        if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]

                if market["mn"] == 'Moneyline':
                    item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
                    for outcome in market["msl"]:
                        if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
                        #if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
                        if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]


            item["last_update"] = timestamp
            item["league"] = event["scn"]

            yield item

Edit:

Based on the answer below, I tried the following two scripts:

controller.py

import json
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from betsson_controlled.spiders.betsson import Betsson_Spider
from scrapy.utils.project import get_project_settings


class MyCrawlerRunner(CrawlerRunner):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        # create crawler (Same as in base CrawlerProcess)
        crawler = self.create_crawler(crawler_or_spidercls)

        # handle each item scraped
        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        # create Twisted.Deferred launching crawl
        dfd = self._crawl(crawler, *args, **kwargs)

        # add callback - when crawl is done cal return_items
        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items

def return_spider_output(output):
    return json.dumps([dict(item) for item in output])

settings = get_project_settings()
runner = MyCrawlerRunner(settings)
spider = Betsson_Spider()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)


reactor.run()
print(deferred)

When I execute controller.py, I get:

<Deferred at 0x7fb046e652b0 current result: '[{"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}]'>

This is a shot in the dark but they've refactored how the crawler runner works in the newly released Scrapy. See the changes made here in the docs and decide if it may help your cause. You're result indicates that your deferred is working but but somehow the spider is either not finishing or not closing. https://docs.scrapy.org/en/1.7/news.html — ThePyGuy, Jul 18 '19 at 23:34
Thanks for thinking of me. I'll look into it. Not sure if I'll keep using Scrapy for this project at all, if it's that complicated to implement such a simple functionality. — Chris1309, Jul 20 '19 at 11:46
I know that my answer is the correct answer we are just missing something. I have this code running production on an API endpoint. but i know the feeling when trying to figure something like this out. Making a requests implementation with all the items and features of scrapy to run concurrently though would probably be as difficult as resolving this issue. We at least know that deferred is working as a callback so you should be able to troubleshoot the problem from here. — ThePyGuy, Jul 20 '19 at 18:06
Try to run your code in a crawl function like I did in the last piece of code with the defer callbacks decorator see if that does anything. I think you may have to stop the reactor for the code to finish executing. reactor.run() is supposed to block until the script is done but its never finishing. Once its done all your items should be in the deferred variable.... — ThePyGuy, Jul 20 '19 at 18:08
upodated answer with another stab at it...try crawlerprocess instead of runner it seems more what you need where as I needed runner. — ThePyGuy, Jul 24 '19 at 09:30

ThePyGuy · Answer 1 · 2019-07-24T09:35:45.453

1

RECENT EDITS: After reading CrawlerProcess vs CrawlerRunner I realized that you probably want CrawlerProcess. I had to use runner since I needed klein to be able to use the deferred object. Process expects only scrapy where was runner expects other scripts/programs to interact with. Hope this helpss.

You need to modify CrawlerRunner/Process and use signals and or callbacks to pass the item into your script in the CrawlerRunner.

How to integrate Flask & Scrapy? If you look at the options in the top answer the one with twisted klein and scrapy is an example of what you are looking for since it is doing the same thing except sending it to a Klein http server after the crawl. You can setup a similar method with the CrawlerRunner to send each item to your script as it is crawling. NOTE: This particular question sends the results to Klein web server after the items are collected. The answer is for making an API which collects the results and waits until crawling is done and sends it as dumps it to JSON, but you can apply this same method to your situation. The main thing to look at is how CrawlerRunner was sub-classed and extended to add the extra functionality.

What you want to be doing is have a separate script which you execute which imports your Spider and extends CrawlerRunner. Then you execute this script it will start your Twisted reactor and start the crawl process using your cutomized runner.

That said -- this problem could probably be solved in an item pipeline. Create a custom item pipeline and pass the item into your script before returning the item.

# main.py

import json
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from twisted.internet import reactor, defer # import we missed
from myproject.spiders.mymodule import MySpiderName
from scrapy.utils.project import get_project_settings


class MyCrawlerProcess(CrawlerProcess):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        crawler = self.create_crawler(crawler_or_spidercls)

        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        dfd = self._crawl(crawler, *args, **kwargs)

        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    return json.dumps([dict(item) for item in output])


process = MyCrawlerProcess()
deferred = process.crawl(MySpider)
deferred.addCallback(return_spider_output)


process.start() - Script should block here again but I'm not sure if it will work right without using reactor.run()
print(deferred)

Again, this code is a guess I havent tested. I hope it sets you in a better direction.

References:

edited Jul 24 '19 at 09:35

answered Jun 29 '19 at 16:37

ThePyGuy

1,025
1
6
15

I think using the item pipeline wouldn't help, since I need all items in one list inside the other script and I don't want to run a particular script on each item? I would think there must be a simpler solution than the one propsed inside the mentioned thread. Wanting to use the scraped data inside a script without prior writing to a database shouldn't be that uncommon. – Chris1309 Jul 04 '19 at 18:36
So you want to pass them when its done all at the same time? If that's the case the solution I showed you with the klein API is what you need. You could also chain the two commands together scrapy crawl foo -o bar.csv && python foobar bar.csv . Scrapy is async so it makes things different. My answer about creating your own CrawlerRunner to gather up the items I believe is correct. If you want to pass them one at a time the item pipeline would work fine. – ThePyGuy Jul 04 '19 at 21:33
I want to have one main_script from which I can: 1) run different crawlers, which return a list with all the found items back to main_script when they are done 2) process the data 3) repeat – Chris1309 Jul 04 '19 at 22:31
I edited my post above. It would be nice if you could take a look, – Chris1309 Jul 04 '19 at 23:06
See changes I made. This should work okay. It should run collect all items and then they should be available in deferred. – ThePyGuy Jul 10 '19 at 20:44
Also please note its kind of pseudo code. I added in support for a 2nd script using the same runner. You can then combine the two lists together. If you need this to happen as each item is scraped and note after all items are collected then you really should be doing this in pipelines.py – ThePyGuy Jul 10 '19 at 21:02
I implemented your code (see edit in my post). So how do I get an actual list of the items from the deferred object? – Chris1309 Jul 11 '19 at 22:25
try str(deferred) or dict(deferred) . Maybe someone else can help from here I implemented the solution and returning deferred from my API endpoint returned the full item list in JSON without adding any decorations. So I knew this was a component to your solution I'm sure https://twistedmatrix.com/documents/16.2.0/core/howto/defer.html this as something to do with it. From here you must use the force. – ThePyGuy Jul 12 '19 at 03:15
possibly [dict(x) for x in deferred] – ThePyGuy Jul 12 '19 at 03:19
I tried the methods you suggested without success. If I use [dict(x) for x in deferred], the script never finishes. – Chris1309 Jul 12 '19 at 11:07
I'm really not sure what your trying to achieve still. Is there a reason an item pipeline wont do the job? If you need all the items at once the crawler runner is the way to do it and the way I did it in the past. I don't know but perhaps the way you are executing the bot inside of that function could be a reason. it may be feeding you the deferred item before the bot runs? I don't know how to make it work for your specific use case since it just worked for me but maybe someone else does. At least I'm fairly certain I've given you the correct path if you require all items at once returned. – ThePyGuy Jul 12 '19 at 19:56
I want to run multiple spiders to scrape betting odds from a varity of sites from the main script every minute (the data is not static) and compare the results of the sites in the main script. All of this should be done in memory. So ideally I want to call a function x inside the main script that executes a crawler and returns a list of all the found items from one Spider. – Chris1309 Jul 12 '19 at 20:16
Throw out all of my code suggestions back at your original code. I think the reason that you are getting each item individually is because you are connected to item passed signal so the callback is issued each time it receives that signal. I'm not sure if you could store the items in a list somehow and link up to spider finished or spider closed signal. then you could return all the items at once. Other than that the only way I know to get all items is the way that I showed you. I'm sure that you understand the code of overriding the crawlrunner. I just wonder why deferred is not working right – ThePyGuy Jul 13 '19 at 01:45
It appears your original script input is working as intended correct? When you run with scrapy crawl your by passing the modifications you made with the signals. You're original script is returning the full list of items which is what you wanted right? – ThePyGuy Jul 13 '19 at 01:48
Yeah! I was wondering if I could collect the items in a list in a custom item pipeline and return the list when the spider is finished, but couldn't get it to work. If I execute the first script, it returns the script output mentioned in my post, which is not what I want because it's a list with the same item over and over again and not a list of all found items. The spider itself runs fine if I run it with scrapy crawl and returns all items if I choose an output like json. I'm not sure if that's what you are referring to? – Chris1309 Jul 13 '19 at 15:23
I also asked the question on Github, but I'm still confused about how to implement the suggested solution. https://github.com/scrapy/scrapy/issues/3856 – Chris1309 Jul 13 '19 at 15:26
Try changing the signal you are connected to spider closed or equivalent -- i think you're still going to end up with a modified runner though. Its confusing but if you've done javascript before understand what deferred means is its adding a asynchronous callback to your signal or function. So whenever this happens there's automatically a callback. That's how JS works and Scrapy uses Twisted to make it asynchronous. – ThePyGuy Jul 13 '19 at 19:04
When I get some time I'll try to get this to work but right now I'm inundated with work :( – ThePyGuy Jul 13 '19 at 19:44
If you want give me the URL or I will just make the runner work on another URL to demonstrate – ThePyGuy Jul 13 '19 at 20:48
I really appreciate your effort. I think the spider doesn't matter, so you can use any URL. :) I'll look into changing the signal. – Chris1309 Jul 14 '19 at 20:09
Yeah I have tried and the path is laid there is just something simple standing in the way of it returning the JSON as expected since I had an application like this in production for some time and the script was blocking so 1. make api request 2. wait until script is done 3. json is delivered You can see something closer to my actual code in the link of my answer where I used it in conjunction with a klein endpoint. – ThePyGuy Jul 14 '19 at 23:47
Can you share the exact code that works? I don't really get where I made a mistake. – Chris1309 Jul 15 '19 at 08:07
This code above that I just edited works. You have to start the twisted reactor when your using crawler runner...you probably want to stop the reactor too. So maybe make a crawl function that starts your multiple spiders and stops the reactor at the end. Running reactor.run() at the end the callback is executed and I'm given a list of JSON objects. Just be sure to update all the myproject and myspider to your code. When it works please accept answer :) – ThePyGuy Jul 15 '19 at 08:26
Code above is tested and working now. https://imgur.com/a/k6Wqgmd Note the import section and also reactor.run() The second method is from the documentation to run multiple spiders and shuts the reactor down. – ThePyGuy Jul 15 '19 at 08:45
Thanks, I updated controller.py, but it's still not working. Do you spot any errors? The script seems to get in an endless loop without returning anything. – Chris1309 Jul 15 '19 at 08:45
Check your answer I've edited it. There is still issue with reactor.run not finishing but if you hit control C you will still get deferred object but you will get a partial list too of items... – ThePyGuy Jul 15 '19 at 08:56
If I execute controller.py like in my first post, it doesn't return anything if I hit control C. – Chris1309 Jul 15 '19 at 09:14
If I get some more time I will try to get it working properly. Even if you try printing deferred after reactor.run()? – ThePyGuy Jul 15 '19 at 09:49
Okay, tried it again. I had some datetime objects that couldn't be sterialized. Now I get results if I print deferred after reactor.run() and hit control C. The only issue is, that it's the same item over and over again like described in my opening post. – Chris1309 Jul 15 '19 at 09:57
Yeah so its working just not sure why yours is working the way it is. I would throw out that 3rd script your using called main it just adds confusion. Do all your logic in the runner script. reactor.run() is supposed to block and deferred is supposed to hook into the item signal and add the item to self.items=[] then the callback at the end returns self.items as json all at once. It works 100% on my production server with klein. Why its working differently here I'm not sure. You have to make sure your script terminates and completes or it will keep adding items to self.items. – ThePyGuy Jul 16 '19 at 02:10
Hey! I'm having the exact same issue and haven't been able to figure out why I'm getting the same item repeatedly appended to the results list. Right now, my script is returning the first item scraped per page over and over. Any additional insight to share? – Erin Jun 07 '21 at 19:08

Issue using Scrapy Spider Output in Python script

1 Answers1

Linked