0

So I'm working on a web-scraping project that essentially pulls a bunch of product information (like price, location, name, etc.) from a list of 20+ websites... So far i have created a generic MasterSpider ( similar to what is discussed here: Creating a generic scrapy spider ), from which I inherit and override depending on the site's specific architecture.

However, after essentially repeating much code and wanting to make this project scalable, I have started working towards generalizing my MaterSpider that way it could be extended to other websites, and ideally instantiated with minimal arguments like just the start_url. In other words, instead of locating elements by Xpath, which are not consistent across domains, I am now looking for html tag attribute values/text values.

This works fine for generic/consistent targets like identifying the category links from the start page (which typically contain the category in the link), but for things like finding the product name, price, etc. it is lacking. Having to build out a list of xpath conditions (like @class = a or b or c/contains(.,'a') or contains(.,'b') ) kind of defeats the purpose.

I realize i could also pass a few xpath conditions to instantiate the spider, which I may just have to do, but I would prefer to make this as easy to use and extensible as possible...

My idea is before parsing the individual product pages, to issue a dummy request that looks for the information I would like, and works backward to actually identify the xpath of the information-which is then used in the subsequent requests.

So I was wondering if anyone had any good ideas on how to extract the Xpath of an element given lets say a list of tag values it could contain, or the matching of text within... I realize a series of Try-catches could work, but again that would be more of a band-aid than a solution, and not very scalable. If I have to use something like selenium or a parser to do this that is also an option...

Really open to any ideas or fresh perspectives.

Thanks!

1 Answers1

0

At work, I have to scrape thousands of news websites, and as you might expect there is no one fits all solution. So our strategy was to have a "generic" method, that through heuristics would try to extract the needed information and for troublesome websites we would have a list of specific xpaths for that website.

So our general structure is something like this:

parsers = {
    "domain1": {
        "item1":  "//div...",
        "item2":  "//div...",
    },
    "domain2": {
        "item1":  "//div...",
        "item2":  "//div...",
    },
}

def parse(self, response):
    domain = urlparse(response.url).netloc # urlparse comes from urllib.parse
    try:
        parser = self.parsers[domain]
        return self.parse_with_parser(response, parser)
    except Exception as e:
        return self.parse_generic(response)

The parsers dict I actually keep in a separate file. You could also keep it in a database or a file and access the information when the spider is loading so you do not to have to edit the crawler every time you need to change something.

Edit:

Answering the second part of your question, depending on what you need done, you could write xpaths that take into account several conditions. For instance:

"//a[contains(@class, 'foo') or contains(@class, 'bar')]"

Maybe even

"//a[contains(@class, 'foo') or contains(@class, 'bar')] | //div[@class='something'] | //td/span"

The pipe operator "|" will allow you to "chain" different expressions that might contain what you want extracted. A and/or operation on different expressions.

Henrique Coura
  • 822
  • 7
  • 16
  • so when you say "through heuristics would try to extract the needed information" can you be more specific... like your parse generic class, how are you extracting info without the xpath? are you just looking for html tags that contain a certain attribute value/text or is it more complicated than that –  Jul 13 '17 at 14:37
  • These are very domain specific and they come through a lot of testing/trial and error. For instance, for extracting the article title, I will look into several places(title tag, og:title meta, try some xpaths, some tags) and have a set of rules for when should I believe I actually have the correct title – Henrique Coura Jul 13 '17 at 14:47
  • I have also edited my answer with something that might suit your needs, and is less complicated/hard to achieve than good heuristics – Henrique Coura Jul 13 '17 at 14:53
  • much appreciated, this is what I was thinking along the lines of. wanted to know how hard the heuristic approach would be and it seems as though I should put that on the back-burner. –  Jul 13 '17 at 14:58