Scraping large number of static html.gz files in scrapy

Question

I have a scrapy spider that looks for static html files on disk using the file:/// command as a start url, but I'm unable to load the gzip files and loop through my directory of 150,000 files which all have the .html.gz suffix, I've tried several different approaches that I have commented out but nothing works so far, my code so far looks as

    from scrapy.spiders import CrawlSpider
    from Scrapy_new.items import Scrapy_newTestItem
    import gzip
    import glob
    import os.path

class Scrapy_newSpider(CrawlSpider):
        name = "info_extract"
        source_dir = '/path/to/file/'
        allowed_domains = []
        start_urls = ['file://///path/to/files/.*html.gz']
    def parse_item(self, response):
            item = Scrapy_newTestItem()
            item['user'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div[2]/div/div[2]/div[1]/h1/span[2]/text()').extract()
            item['list_of_links'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div[2]/div/div[3]/div[3]/a/@href').extract()
            item['list_of_text'] = response.xpath('//*[@id="page-user"]/div[1]/div/div/div/div/div/div/a/text()').extract()

Running this gives the error code

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 13, in download_request
    with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: 'path/to/files/*.html'

Changing my code so that the files are first unziped and then passed through as follow:

source_dir = 'path/to/files/'    
for src_name in glob.glob(os.path.join(source_dir, '*.gz')):
     base = os.path.basename(src_name)
     with gzip.open(src_name, 'rb') as infile:
          #start_urls = ['/path/to/files*.html']#
          file_cont = infile.read()
          start_urls = file_cont#['file:////file_cont']

Gives the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: %3C

@user3191569 what are you doing later with the content of those files? Remember that `scrapy` is about requests, so opening local files isn't really something `scrapy` should be used for. Can you share the whole (or an idea) code of the spider? — eLRuLL, Mar 14 '17 at 15:30
@eLRuLL I've completed the items sections is this what you meant ? — user3191569, Mar 19 '17 at 08:36

score 0 · Answer 1 · answered Mar 14 '17 at 15:41

You don't have to use start_urls always on a scrapy spider. Also CrawlSpider is commonly used in conjunction with rules for specifying routes to follow and what to extract in big crawling sites, you'll maybe want to use scrapy.Spider directly instead of CrawlSpider.

Now, the solution relies on using the start_requests method that a scrapy spider offers, which handles the first requests of the spider. If this method is implemented in your spider, start_urls won't be used:

from scrapy import Spider

import gzip import glob import os

class ExampleSpider(Spider): name = 'info_extract'

def start_requests(self):
    os.chdir("/path/to/files")
    for file_name in glob.glob("*.html.gz"):
        f = gzip.open(file_name, 'rb')
        file_content = f.read()
        print file_content # now you are reading the file content of your local files

Now, remember that start_requests must return an iterable of requests, which ins't the case here, because you are only reading files (I assume you are going to create requests later with the content of those files), so my code will be failing with something like:

CRITICAL:
Traceback (most recent call last):
  ...
/.../scrapy/crawler.py", line 73, in crawl
    start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable

Which points that I am not returning anything from my start_requests method (None), which isn't iterable.

thank you for the answer but I'm not quite sure how I would make the `start_requests iterable` if I want to pass it on to a `Parse_items` method without something like `return (file_content, call_back=self.parse_item)` which it self fails with `return (file_content, callback=self.parse_item) # ^ SyntaxError: invalid syntax ` — user3191569, Mar 15 '17 at 10:09
that's exactly why I commented on your question to share your complete spider code, or to explain better what you are trying to do with the content of those files. Please update your question if necessary. — eLRuLL, Mar 15 '17 at 15:06
if you are not doing external requests you don't need scrapy. Looks like you are only checking local files — eLRuLL, Mar 23 '17 at 13:55

score 0 · Answer 2 · edited May 23 '17 at 11:46

0

Scrapy will not be able to deal with the compressed html files, you have to extract them first. This can be done on-the-fly in Python or you just extract them on operating system level.

Related: Python Scrapy on offline (local) data

edited May 23 '17 at 11:46

Community

1
1

answered Mar 14 '17 at 17:53

rfelten

181
6

Scraping large number of static html.gz files in scrapy

2 Answers2