I have the following custom pipeline for downloading JSON files. It was functioning fine until I need to add the __init__
function, in which I subclass the FilesPipeline
class in order to add a few new properties. The pipeline takes URLs that are to API endpoints and downloads their responses. The folders are properly created when running the spider via scrapy crawl myspider
and the two print statements in the file_path
function show the correct values (filename and filepath). However, the files are never actually downloaded.
I did find a few similar questions about custom file pipelines and files not downloading (here (the solution was they needed to yield the items instead of returning them) and here (the solution was needing to adjust the ROBOTSTXT_OBEY setting) for example), but the solutions did not work for me.
What am I doing wrong (or forgetting to do when subclassing the FilesPipeline
)? I've been racking my brain over this issue for a good 3 hours and my google-fu has not yielded any resolutions for my case.
class LocalJsonFilesPipeline(FilesPipeline):
FILES_STORE = "json_src"
FILES_URLS_FIELD = "json_url"
FILES_RESULT_FIELD = "local_json"
def __init__(self, store_uri, use_response_url=False, filename_regex=None, settings=None):
# super(LocalJsonFilesPipeline, self).__init__(store_uri)
self.store_uri = store_uri
self.use_response_url = use_response_url
if filename_regex:
self.filename_regex = re.compile(filename_regex)
else:
self.filename_regex = filename_regex
super(LocalJsonFilesPipeline, self).__init__(store_uri, settings=settings)
@classmethod
def from_crawler(cls, crawler):
if not crawler.spider:
return BasePipeline()
store_uri = f'{cls.FILES_STORE}/{crawler.spider.name}'
settings = crawler.spider.settings
use_response_url = settings.get('JSON_FILENAME_USE_RESPONSE_URL', False)
filename_regex = settings.get('JSON_FILENAME_REGEX')
return cls(store_uri, use_response_url, filename_regex, settings)
def parse_path(self, value):
if self.filename_regex:
try:
return self.filename_regex.findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
# example: /p/russet-potatoes-5lb-bag-good-38-gather-8482/-/A-77775602
link_path = os.path.splitext(urlparse(value).path)[0] # omit extension if there is one
link_params = link_path.rsplit('/', 1)[1] # preserve the last portion separated by forward-slash (A-77775602)
return link_params if '=' not in link_params else link_params.split('=', 1)[1]
def get_media_requests(self, item, info):
json_url = item.get(self.FILES_URLS_FIELD)
if json_url:
filename_url = json_url if not self.use_response_url else item.get('url', '')
return [Request(json_url, meta={'filename': self.parse_path(filename_url), 'spider': info.spider.name})]
def file_path(self, request, response=None, info=None):
final_path = f'{self.FILES_STORE}/{request.meta["spider"]}/{request.meta["filename"]}.json'
print('url', request.url)
print('downloading to', final_path)
return final_path
And the custom settings of my spider
class MockSpider(scrapy.Spider):
name = 'mock'
custom_settings = {
'ITEM_PIPELINES': {
'mock.pipelines.LocalJsonFilesPipeline': 200
},
'JSON_FILENAME_REGEX': r'products\/(.+?)\/ProductInfo\+ProductDetails'
}
Log with the level set to debug
C:\Users\Mike\Desktop\scrapy_test\pipeline_test>scrapy crawl testsite
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: pipeline
_test)
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9
.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (
tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)], pyOp
enSSL 19.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows
-7-6.1.7601-SP1
2020-07-19 11:23:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.se
lectreactor.SelectReactor
2020-07-19 11:23:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'pipeline_test',
'LOG_STDOUT': True,
'NEWSPIDER_MODULE': 'pipeline_test.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['pipeline_test.spiders']}
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet Password: 0454b083df
d2028a
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled item pipelines:
['pipeline_test.pipelines.LocalJsonFilesPipeline']
2020-07-19 11:23:08 [scrapy.core.engine] INFO: Spider opened
2020-07-19 11:23:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.[testsite].com/robots.txt> (referer: None)
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails> (re
ferer: None)
2020-07-19 11:23:08 [stdout] INFO: url
2020-07-19 11:23:08 [stdout] INFO: https://[testsite]/vpd/v1/products/pro
d6149174-product/ProductInfo+ProductDetails
2020-07-19 11:23:08 [stdout] INFO: downloading to
2020-07-19 11:23:08 [stdout] INFO: json_src/[testsite]/prod6149174-product.json
2020-07-19 11:23:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails>
{'json_url': 'https://[testsite].com/vpd/v1/products/prod6149174-product/Prod
uctInfo+ProductDetails',
'local_json': [],
'url': 'https://[testsite].com/store/c/nature-made-super-b-complex,-tablets/
ID=prod6149174-product'}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-19 11:23:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 506,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5515,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.468001,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 19, 15, 23, 9, 96399),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 14,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 19, 15, 23, 8, 628398)}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Spider closed (finished)