I want to let Scrapy crawl local html files but am stuck because the header lacks the Content-type field. I've followed the tutorial here: Use Scrapy to crawl local XML file - Start URL local file address So basically, I am pointing scrapy to local urls, such as file:///Users/felix/myfile.html
However, scrapy will crash then, since it looks like (on MacOS) the resulting response object does not contain the required field Content-type
.
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/felix/IdeaProjects/news-please/newsplease/__init__.py
[scrapy.core.scraper:158|ERROR] Spider error processing <GET file:///Users/felix/IdeaProjects/news-please/newsplease/0a2199bdcef84d2bb2f920cf042c5919> (referer: None)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/felix/IdeaProjects/news-please/newsplease/crawler/spiders/download_crawler.py", line 33, in parse
if not self.helper.parse_crawler.content_type(response):
File "/Users/felix/IdeaProjects/news-please/newsplease/helper_classes/parse_crawler.py", line 116, in content_type
if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
AttributeError: 'NoneType' object has no attribute 'decode'
Someone suggested to run a simple http server, see Python Scrapy on offline (local) data but that is not an option, mainly because of the overhead caused by running another server.
I need to use scrapy in the first place, as we have a larger framework that uses scrapy. We plan to add the functionality to crawl from local files to that framework. However, since there are several questions on SO on how to crawl from local files (see previous links), I assume this problem is of general interest.