Scrapy: upgrade the pipeline to send items

Question

I have a class in pipelines.py that sends and threads to my server's API:

class MyPipeline:
    def process_item(self, item, spider):
        data = {
            "source_id": 'name_of_the_running_spider,
            "token": "token",
            "products": [dict(item)],
        }
        headers = {'Content-Type': 'application/json'}
        url = 'http://for.example.com/my-api/'
        requests.post(url=url, headers=headers, data=json.dumps(data))
        return item

The problem is that the pipeline sends each time under one item ("products": [dict(item)]). Is it possible to somehow pass a list items to "products" (for example [dict(item)*10])? If in the spider itself, it can be organized using a loop and a counter, but how to implement it through pipeline.py

for the name of the spider just include it as a field when you yield the items from the parsing method — Alexander, Aug 25 '22 at 23:05
@Alexander yes, thank you, it works that way, just if there are alternative methods, I would like to know. — m_sasha, Aug 26 '22 at 00:04

Alexander · Accepted Answer · 2022-08-26T21:53:07.873

After some testing I have come up with a possible solution, that adds functionality to the pipeline by storing each item in a list, and having a separate method that manages the number of items collected and automatically triggers the request once length of the list has reached a certain threshold, and then resets the list back to empty. Then in the pipelines close_spider method you can check if there are any remaining requests that haven't been sent and send those.

For the spider name, the pipelines process_item method receives the instance of the spider. So in order to get the spider's name attribute all you need to do is use spider.name. If instead you are trying to get the name of the spider class then you can either do some regex on type(spider) or simply add the class name as an attribute to the spider and get it through spider.classname.

For example:

pipelines.py

class MyPipeline:

    def __init__(self):
        self._request_data = []
        self._url = 'http://for.example.com/my-api/'
        self._headers = {'Content-Type': 'application/json'}
        self._max_number_of_requests = 10

    def process_item(self, item, spider):
        spidername = spider.name
        if len(self._request_data) >= self._max_number_of_requests:
            self.send_post_request(spidername)
        self._request_data.append(item)
        return item

    def send_post_request(self, spidername):
        data = {"source_id": spidername,
                "token": "token",
                "products": self._request_data}
        response = requests.post(url=self._url,
                                 headers=self._headers,
                                 data=json.dumps(data))
        if response.status_code != 200:
            print(f"REQUEST FAILED: status code {response.status_code}")
        self._request_data = []

    def close_spider(self, spider):
        if len(self._request_data) > 0:
            self.send_post_request(spider.name)

cool!!! I additionally output the buffer size to the settings file — m_sasha, Aug 27 '22 at 01:53

Scrapy: upgrade the pipeline to send items

1 Answers1