After some testing I have come up with a possible solution, that adds functionality to the pipeline by storing each item in a list, and having a separate method that manages the number of items collected and automatically triggers the request once length of the list has reached a certain threshold, and then resets the list back to empty. Then in the pipelines close_spider
method you can check if there are any remaining requests that haven't been sent and send those.
For the spider name, the pipelines process_item
method receives the instance of the spider. So in order to get the spider's name
attribute all you need to do is use spider.name
. If instead you are trying to get the name of the spider class then you can either do some regex on type(spider)
or simply add the class name as an attribute to the spider and get it through spider.classname
.
For example:
pipelines.py
class MyPipeline:
def __init__(self):
self._request_data = []
self._url = 'http://for.example.com/my-api/'
self._headers = {'Content-Type': 'application/json'}
self._max_number_of_requests = 10
def process_item(self, item, spider):
spidername = spider.name
if len(self._request_data) >= self._max_number_of_requests:
self.send_post_request(spidername)
self._request_data.append(item)
return item
def send_post_request(self, spidername):
data = {"source_id": spidername,
"token": "token",
"products": self._request_data}
response = requests.post(url=self._url,
headers=self._headers,
data=json.dumps(data))
if response.status_code != 200:
print(f"REQUEST FAILED: status code {response.status_code}")
self._request_data = []
def close_spider(self, spider):
if len(self._request_data) > 0:
self.send_post_request(spider.name)