So I'm working on a web-scraping project that essentially pulls a bunch of product information (like price, location, name, etc.) from a list of 20+ websites... So far i have created a generic MasterSpider ( similar to what is discussed here: Creating a generic scrapy spider ), from which I inherit and override depending on the site's specific architecture.
However, after essentially repeating much code and wanting to make this project scalable, I have started working towards generalizing my MaterSpider that way it could be extended to other websites, and ideally instantiated with minimal arguments like just the start_url. In other words, instead of locating elements by Xpath, which are not consistent across domains, I am now looking for html tag attribute values/text values.
This works fine for generic/consistent targets like identifying the category links from the start page (which typically contain the category in the link), but for things like finding the product name, price, etc. it is lacking. Having to build out a list of xpath conditions (like @class = a or b or c/contains(.,'a') or contains(.,'b') ) kind of defeats the purpose.
I realize i could also pass a few xpath conditions to instantiate the spider, which I may just have to do, but I would prefer to make this as easy to use and extensible as possible...
My idea is before parsing the individual product pages, to issue a dummy request that looks for the information I would like, and works backward to actually identify the xpath of the information-which is then used in the subsequent requests.
So I was wondering if anyone had any good ideas on how to extract the Xpath of an element given lets say a list of tag values it could contain, or the matching of text within... I realize a series of Try-catches could work, but again that would be more of a band-aid than a solution, and not very scalable. If I have to use something like selenium or a parser to do this that is also an option...
Really open to any ideas or fresh perspectives.
Thanks!