2

How can i configure the scrapy to write in csv without delay.

If i ran scrapy crawl spider_1 and let say spider_1 has an expected 200 items then the scrapy will going to write in csv but in by batch. I don't know where to configure this.

I tried the

CONCURRENT_REQUESTS = 1

CONCURRENT_ITEMS = 1

but still it writes in csv file by 15+ batches.

Here is they way i tested it..

while sleep 2; do cat results/price.csv | wc -l; done;

result was

   0
  35
  35
  35
  35
  52
  52
  52
  52
  70
  70

Notice that it writes the first 32th items then 20th then 18th.

What i want is that write the item right after it gets the data. How can i do that?

ji-ruh
  • 725
  • 1
  • 7
  • 24
  • scrapy don't make synchronous request. It send lots of request and wait for response. Therefore you don't get sequence output. – Anurag Misra Sep 01 '17 at 11:46
  • You may want to have a look at how [`CsvItemExporter`](https://github.com/scrapy/scrapy/blob/dfe6d3d59aa3de7a96c1883d0f3f576ba5994aa9/scrapy/exporters.py#L206) is implemented, esp. `.export_item()`. – paul trmbrth Sep 01 '17 at 11:48
  • I had similar problem. What I did I write all the data into mongodb and then write all data from there – Anurag Misra Sep 01 '17 at 11:49
  • @AnuragMisra actually your suggestion make sense. My current implementation was scrapy + django. The reason i don't want the delay because I want to show the incrementing number (csv row count) in angular. So with that, i can create a drf endpoint pointing to django model. Thanks – ji-ruh Sep 01 '17 at 16:25
  • How does your pipeline be implemented? Scrapy sends HTTP requests in order, but processes responses out of order. According to the framework diagram, the saving of items should be handled in pipeline. Make sure that your pipeline writes the item into file immediately, and remember to flush the cache. – rojeeer Sep 01 '17 at 16:44
  • 1
    @rojeeer, yes I implemented it in pipeline accordingly. I used `CsvItemExporter`. in spider_open , i do `spider_opened`. in `spider_closed` i do `self.exporter.finish_exporting()`. in `process_item` i do `self.exporter.export_item(item)` and also i save it in django database. When I run the crawler and check in django database, it is synchronous unlike in csv file. – ji-ruh Sep 01 '17 at 16:56
  • @ji-ruh you can use `elasticsearch` it is a real-time database. so it will help you. – Anurag Misra Sep 02 '17 at 05:52

1 Answers1

2

As I commented, when writing the item into the file, the item is not written to disk immediately, it is buffered unless the buffer is full or you flush the buffer. Since you use CsvItemExporter, which does not flush the buffer for each item, see csvwriter does not save to file whay, you need to call flush if you do need this feature.

One option is that you can extend the CsvItemExporter and overwrite export_item function, e.g:

class MyCsvItemExporter(CsvItemExporter):
    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))
        self.csv_writer.writerow(values)
        #flush
        self.stream.flush()

I don't test the code yet. And also there is a topic about python flush to file that is worth read.

Hope it is helpful. Thanks

rojeeer
  • 1,991
  • 1
  • 11
  • 13