0

I want to let Scrapy crawl local html files but am stuck because the header lacks the Content-type field. I've followed the tutorial here: Use Scrapy to crawl local XML file - Start URL local file address So basically, I am pointing scrapy to local urls, such as file:///Users/felix/myfile.html

However, scrapy will crash then, since it looks like (on MacOS) the resulting response object does not contain the required field Content-type.

/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/felix/IdeaProjects/news-please/newsplease/__init__.py
[scrapy.core.scraper:158|ERROR] Spider error processing <GET file:///Users/felix/IdeaProjects/news-please/newsplease/0a2199bdcef84d2bb2f920cf042c5919> (referer: None)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/felix/IdeaProjects/news-please/newsplease/crawler/spiders/download_crawler.py", line 33, in parse
    if not self.helper.parse_crawler.content_type(response):
  File "/Users/felix/IdeaProjects/news-please/newsplease/helper_classes/parse_crawler.py", line 116, in content_type
    if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
AttributeError: 'NoneType' object has no attribute 'decode'

Someone suggested to run a simple http server, see Python Scrapy on offline (local) data but that is not an option, mainly because of the overhead caused by running another server.

I need to use scrapy in the first place, as we have a larger framework that uses scrapy. We plan to add the functionality to crawl from local files to that framework. However, since there are several questions on SO on how to crawl from local files (see previous links), I assume this problem is of general interest.

pedjjj
  • 958
  • 3
  • 18
  • 40

1 Answers1

2

You can actually fork news-please or change scrapy to always return True in the function def content_type(self, response) in newsplease/helper_classes/parse_crawler.py if it is from local storage.

The new file will look like this:

def content_type(self, response):
    """
    Ensures the response is of type

    :param obj response: The scrapy response
    :return bool: Determines wether the response is of the correct type
    """
    if response.url.startswith('file:///'):
        return True
    if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
        self.log.warn("Dropped: %s's content is not of type "
                      "text/html but %s", response.url,
                      response.headers.get('Content-Type'))
        return False
    else:
        return True
pedjjj
  • 958
  • 3
  • 18
  • 40
Mikko
  • 602
  • 4
  • 18