I'm building a scraper with Scrapy framework in order to scrap a webshop.This webshop has several cat and subcat
I finished already the spider and it works like a charm. I currently use it by using the start url =[]
parameter for the spider ( crawler basespider. ) to scrap the subcat i want.
Now, i'm at the step to deploy the spider in order to run them on a regular basis.I know the solution scrapyd and already installed it.
But I don't know what is the good way to build the architecture for spider deploy
Making one spider for each subcat of the webshop so it means each spiders will have a start url /rules.
Using the start url with many start url for each of my subcat webshop which i get from a files, db whatever.
An other problem is as i'm using just one pipeline which store items inside a mysql db. I would need to make one table for each subcat, so i maybe need one pipeline for each spider as i'm making the sql query inside the pipeline.
And if i decide to make one spider for each subcat, it will means i need to make a lot of pipeline class but i'm not sure that is the good way to make spider-pipeline couple tighlty. Here is my pipeline class
class MySQLStorePipeline(object):
"""A pipeline to store the item in a MySQL database.
This implementation uses Twisted's asynchronous database API.
"""
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbargs = dict(
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DBNAME'],
user=settings['MYSQL_USER'],
passwd=settings['MYSQL_PASSWD'],
charset='utf8',
use_unicode=True,
port=3306,
init_command='SET NAMES UTF8',
)
dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
return cls(dbpool)
def process_item(self, item, spider):
# run db query in the thread pool
d = self.dbpool.runInteraction(self._do_upsert, item, spider)
d.addErrback(self._handle_error, item, spider)
# at the end return the item in case of success or failure
d.addBoth(lambda _: item)
# return the deferred instead the item. This makes the engine to
# process next item (according to CONCURRENT_ITEMS setting) after this
# operation (deferred) has finished.
return d
def _do_upsert(self, conn, item, spider):
"""Perform an insert or update."""
guid = self._get_guid(item)
now = datetime.utcnow().replace(microsecond=0).isoformat(' ')
conn.execute("""SELECT EXISTS(
SELECT 1 FROM vtech WHERE guid = %s
)""", (guid, ))
ret = conn.fetchone()[0]
if ret:
conn.execute("""
UPDATE vtech
SET prix=%s, stock=%s, updated=%s
WHERE guid=%s
""", (item['prix'], item['stock'], now, guid))
print '------------------------'
print 'Data updated in Database'
print '------------------------'
else:
conn.execute("""
INSERT INTO vtech (guid, nom, url, prix, stock, revendeur, livraison, img, detail, bullet, categorie, updated, created)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
""", (guid, item['nom'], item['url'], item['prix'], item['stock'], item['revendeur'], item['livraison'], item['img'], item['detail'],item['bullet'], item['categorie'], now, 0))
print '------------------------'
print 'Data Stored in Database'
print '------------------------'
def _handle_error(self, failure, item, spider):
"""Handle occurred on db interaction."""
# do nothing, just log
log.err(failure)
def _get_guid(self, item):
"""Generates an unique identifier for a given item."""
# hash based solely in the url field
return md5(item['url']).hexdigest()
I had a look to configparser for making a config.ini trying to managne starting url and sql query but looks an ugly hack to code.
I read a lot of SO post already but i never find a question of design structure of scraper. But it still unclear for me what pass has to be used for scrapy production spiders, for long term.
Thanks for your answers.
Here are some post i read about those