2

Hi I am working on scrapy for fetching some html pages,

I had written my spider and i had fetched the required data from the pages in spider.py file, and in my pipeline.py file i want to write all the data in to a csv file created dynamically with the name of the spider and below is my pipeline.py code

pipeline.py:

from scrapy import log
from datetime import datetime


class examplepipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)

    def spider_opened(self, spider):
        log.msg("opened spider  %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
        self.exampleCsv = csv.writer(open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d/%m/%Y,%H-%M-%S")), "wb"),
                   delimiter=',', quoting=csv.QUOTE_MINIMAL)
        self.exampleCsv.writerow(['Listing Name', 'Address','Pincode','Phone','Website'])           

    def process_item(self, item, spider):
        log.msg("Processsing item " + item['title'], level=log.DEBUG)
        self.exampleCsv.writerow([item['listing_name'].encode('utf-8'),
                                    item['address_1'].encode('utf-8'),
                                    [i.encode('utf-8') for i in item['pincode']],
                                    item['phone'].encode('utf-8'),
                                    [i.encode('utf-8') for i in item['web_site']]
                                    ])
        return item 


    def spider_closed(self, spider):
        log.msg("closed spider %s at %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))

Result:

--- <exception caught here> ---
  File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
    return receiver(*arguments, **named)
  File "/home/local/user/example/example/pipelines.py", line 19, in spider_opened
    self.examplecsv = csv.writer(open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d/%m/%Y,%H-%M-%S")), "wb"),
exceptions.IOError: [Errno 2] No such file or directory: 'example(27/07/2012,10-30-40).csv'

Here actually spider name is example

I don't understand whats wrong in the above code, it should create csv file dynamically with spider name, but showing the above mentioned error, can anyone please let me know whats happening there.........

Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313

2 Answers2

1

The problem is with forward slash(directory separator) in your filename. It is not allowed. Try using some other character in the date.

More info here http://www.linuxquestions.org/questions/linux-software-2/forward-slash-in-filenames-665010/

This link is helpful for getting the format you want How to print date in a regular format in Python?

>>> import datetime
>>> datetime.date.today()
datetime.date(2012, 7, 27)
>>> str(datetime.date.today())
'2012-07-27'

Use this in your code

open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d-%m-%Y:%H-%M-%S"))
Community
  • 1
  • 1
Kamal
  • 3,068
  • 5
  • 26
  • 26
  • oh k how can we create a file with spidername and date then ? – Shiva Krishna Bavandla Jul 27 '12 at 05:36
  • give different date string format. http://docs.python.org/library/datetime.html#strftime-and-strptime-behavior – Babu Jul 27 '12 at 05:38
  • @kamal: I think the problem is not slash because when i have given just csv/%s a csv file with spider name is creating in csv folder , the problem is with date and time ,i think we cannot create a csv file with date , if possible please let me know – Shiva Krishna Bavandla Jul 27 '12 at 05:54
  • I replaced the forward slashes in the format string with dashes and it worked fine. – Lenna Jul 27 '12 at 06:03
  • @shiva it worked because you must already be having a csv directory. open wouldn't directories for you. – Kamal Jul 27 '12 at 06:05
  • @kamal: yes ofcourse i had found a solution actually we cannot create the filenames with slashes so it has diplayed that i had made date format in d-m-y so it worked anyway thanks very much for your support – Shiva Krishna Bavandla Jul 27 '12 at 06:09
0

As Kamal pointed out, the immediate issue is the presence of forward slashes in the file name you create. Kamal's solution works, but I would not fix this by using the method Kamal suggested but with:

open("%s(%s).csv"% (spider.name, datetime.now().replace(microsecond=0).isoformat())

The main thing here is the use of .isoformat() to put it in the ISO 8601 format:

YYYY-MM-DDTHH:MM:SS.mmmmmm

which has the advantage of being trivially sortable in increasing chronological order. The .replace(microsecond=0) call is to remove the microsecond information, in which case the trailing .mmmmm will be absent from the output of .isoformat(). You can drop the call to .replace() if you want to keep microsecond information. When I drop the microseconds, I write the rest of my applications to prevent two invocations from creating the same file.

Also, you could drop your custom __init__ and rename spider_opened to open_spider, and spider_closed to close_spider. Scrapy will automatically call open_spider when a spider is opened and close_spider when a spider is closed. You do not have to hook onto the signals. The documentation mentions these methods as far back as Scrapy 0.7.

Community
  • 1
  • 1
Louis
  • 146,715
  • 28
  • 274
  • 320