0

I started a CrawlSpider to crawling a category from an online shopping web page. There was about 760k items. After 11 hours, I looked at logs and I realized that the spider was somehow closed. It failed when close_spider() function, from pipeline, was called. Basically, my own implementation of close_spider() function builds connection between spider and bigquery and transfers locally saved jsonlines file to bigquery database. However, as I mentioned, it fails in this step.

I manually tried the close_spider() function and it successfully transferred the same saved jsonlines file to bigquery. By the way, there are about 466k lines in jsonlines file. Also I've tried the same spider on a different category that has 8k items and it succesfully transferred feed file to bigquery and no error message received. I came across this error twice. When I first received this error message the spider scraped 700k items.

Here is the log file:

2019-06-11 23:18:12 [scrapy.extensions.logstats] INFO: Crawled 480107 pages (at 787 pages/min), scraped 466560 items (at 772 items/min)
2019-06-11 23:18:33 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-11 23:18:33 [scrapy.core.engine] ERROR: Scraper close failure
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/togayyazar/etsy/etsy/pipelines.py", line 20, in close_spider
    self.write_to_bq()
  File "/home/togayyazar/etsy/etsy/pipelines.py", line 30, in write_to_bq
    print("-----BIGQUERY-----")
OSError: [Errno 5] Input/output error
2019-06-11 23:18:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 217195256,
 'downloader/request_count': 480652,
 'downloader/request_method_count/GET': 480652,
 'downloader/response_bytes': 29983627714,
 'downloader/response_count': 480652,
 'downloader/response_status_count/200': 480373,
 'downloader/response_status_count/301': 254,
 'downloader/response_status_count/400': 6,
 'downloader/response_status_count/503': 19,
 'dupefilter/filtered': 358230,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 6, 11, 23, 18, 33, 739888),
 'httperror/response_ignored_count': 6,
 'httperror/response_ignored_status_count/400': 6,
 'item_scraped_count': 466833,
 'log_count/ERROR': 1,
 'log_count/INFO': 663,
 'memusage/max': 456044544,
 'memusage/startup': 61976576,
 'request_depth_max': 88,
 'response_received_count': 480379,
 'retry/count': 19,
 'retry/reason_count/503 Service Unavailable': 19,
 'scheduler/dequeued': 480652,
 'scheduler/dequeued/memory': 480652,
 'scheduler/enqueued': 480652,
 'scheduler/enqueued/memory': 480652,
 'start_time': datetime.datetime(2019, 6, 11, 12, 30, 12, 400853)}
2019-06-11 23:18:33 [scrapy.core.engine] INFO: Spider closed (finished)

And close_spider() function :

def close_spider(self, spider):
    self.file.close()
    self.write_to_bq()

def write_to_bq(self):
    print("-----BIGQUERY-----")
    bq=BigQuery()
    dataset_name=self.category

    if not bq.dataset_exists(dataset_name):
        bq.create_dataset(dataset_name) 

    path="/home/togayyazar/etsy/"+self.file_path
    table_name=self.date_time
    bq.load_table(
        path,
        table_name,
        dataset_name,
        'NEWLINE_DELIMITED_JSON',
    )

Any help will be appreciated.

1 Answers1

0

If you look into the error trace you will see you got an exception in the print() function.

File "/home/togayyazar/etsy/etsy/pipelines.py", line 30, in write_to_bq
    print("-----BIGQUERY-----") OSError: [Errno 5] Input/output error

Check this thread to understand the problem.

I suggest you to simply remove the print or replace it with the logging module, the spider has an attribute logger if you want to use, but if you want to have a logger with the name of your Pipeline you can do this:

import logging

class YourPipeline(object):

    def __init__(self):
        # Create a logger with the pipeline name
        self.logger = logging.getLogger(self.__class__.__name__) 

    def close_spider(self, spider):
        self.file.close()
        self.write_to_bq()

    def write_to_bq(self):
        self.logger.debug("-----BIGQUERY-----")
        # rest of you code