2

I am working on scrapy and scraped a site and fetched all the information

Actually I had 3 spiders with different data, I had created these 3 spiders in the same folder with the following structure

scrapy.cfg
myproject/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
         __init__.py
           spider1.py
           spider2.py
           spider3.py

Now when we run that particular spider I need to create a csv file through pipeline with that spider name , for example

spider1.csv,spider2.csv,spider3.csv and so on (Spiders are not limited they may be more)> According to the number of spiders and spider names I want to create csv files

Here whether we can create more than one pipeline in pipeline.py ? also how to create the csv file with spider name dynamically if more than one spider exists

Here I had 3 spider and I want to run all the 3 spiders at once(by using scrapyd), when I run all the 3 spiders 3 csv files with their spider names should be created. And I want to schedule this spiders running for every 6 hours. If something is wrong in my explanation please correct me and let me know how to achieve this.

Thanks in advance

Edited Code: For example I am pasting my code for only spider1.py

code in spider1.py:

class firstspider(BaseSpider):
    name = "spider1"
    domain_name = "www.example.com"
    start_urls = [
                   "www.example.com/headers/page-value"
                 ]
def parse(self, response):
    hxs = HtmlXPathSelector(response)
            ........
            .......
            item = Spider1Item()
            item['field1'] = some_result
            item['field2'] = some_result
            .....
            .....
            return item

Pipeline.py code:

import csv
from csv import DictWriter

class firstspider_pipeline(object):

def __init__(self):
    self.brandCategoryCsv = csv.writer(open('../%s.csv' % (spider.name), 'wb'),
    delimiter=',', quoting=csv.QUOTE_MINIMAL)
    self.brandCategoryCsv.writerow(['field1', 'field2','field3','field4'])



def process_item(self, item, spider):
    self.brandCategoryCsv.writerow([item['field1'],
                                item['field2'],
                                item['field3'],
                                item['field4'])
    return item 

As I stated before when I run the above spider with spider name, a csv file with the spider name will be created dynamically..... but now when if I run the remaining spiders like spider2,spider3,spider3 , the csv files with their corresponding spider names should generate.

  1. whether the above code is enough for the above functionality?

  2. whether we need to create another pipeline class to create another csv file?(Is it possible to create more than one pipeline classes in a single pipeline.py file?)

  3. If we create multiple pipeline classes in a single pipeline.py file, how to match the particular spider to its related pipeline class

I want to achieve the same functionality when saving to database, I mean when I run the spider1 all data of spider1 should saved to database into a table with relative spider name. Here for each spider I had different sql queries(so need to write different pipeline classes)

  1. Here intension is when we run multiple spiders all at a time(using scrapyd) , multiple csv files should generate with their spider names and multiple tables should be created with spider names(When saving in to database)

Sorry if am wrong anywhere, I hope its well explained and if not please let me know.

Lain
  • 2,166
  • 4
  • 23
  • 47
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313

1 Answers1

3

You are generally on the right track.

But there are some points i can immediatly point out:

  1. You probably don't need (=shouldn't use) a class! Python is not Java. If your class consist of only 2 methods and the first one is the __init__-method you almost certainly don't need a class but a function would do just fine. Less clutter = better code!

  2. SO isn't the right place for a general code-review. Try codereview instead. SO-useres are a friendly (mostly) and helpful bunch, but they don't like to write your code. They like to explain, advice and correct. So try to implement your application and if you get in trouble you can't solve yourself, come back again and ask for advice. As said above, you are conceptually on the right track, just try to implement it.

  3. You seem to have a misunderstanding of the class concept. At least as long as it is python-classes:

    1. You don't need a BaseSpider class as far as i can see. What would be the difference between the base-class and the sub-classes? Deriving classes doesn't make your program OO, or better, or whatever. Search for Liskovs principle to get a general understanding for when a subclass may be appropriate in python. (It's somewhat a reverse logic, but it's one of the fastest ways to see if you should subclass or change your approach.)

    2. There is a distinct difference between python class-variables which are declared immediatly after the class declaration and instance-variables which are initialized in the __init__ method. Class variables are SHARED between all instances of a class, where instance-variables are private to the individual instances. You almost never want to use class-variables, which are a Singleton-Pattern, something you want to avoid in most cases, because it causes headaches and grievance in debugging.

Therefore i would modify your Spider-class like:

class Spider(object):
    def __init__(self, name, url=None):
        self.name = name
        self.domain_name = url
        self.start_urls = [url]
        ...

crawlers = [Spider('spider %s' %i) for i in xrange(4)] #creates a list of 4 spiders 

But maybe you are using a declarative metaclass-approach, but i can't see that from your posted code.

If you want to run your crawlers in parallel you should consider the threading-module. It's meant for consecutive I/O-operations opposed to the multiprocessing-module which is meant for parallel computing.

You are conceptualy on the right track. Break your project into small pieces and come back every time you run into an error.

Just don't expect to get a complete answer on a question like: "I wan't to recreate Google, how can i do it in the best way and shortest time!" ;-)

Community
  • 1
  • 1
Don Question
  • 11,227
  • 5
  • 36
  • 54