8

I'm new in scraping. I've wrote a scraper which will scrape Maplin store. I used Python as the language and BeautifulSoup to scrape the store.

I want to ask that if I need to scrape some other eCommerce store (say Amazon, Flipkart), do I need to customize my code since they have different HTML schema (id and class names are different, plus other things as well). So, the scraper I wrote will not work for other eCommerce store.

I want to know how price-comparison sites scrape data from all the online stores? Do they have different code for different online store or is there's a generic one? Do they study the HTML schema of every online store?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Praful Bagai
  • 16,684
  • 50
  • 136
  • 267

2 Answers2

17

do I need to customize my code

Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.

I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.

want to know how price-comparison sites scrape data from all the online stores?

Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.


The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.

See, as an example, Autoscraping tool provided by Scrapinghub:

Autoscraping is a tool to scrape web sites without any programming knowledge. You just annotate web pages visually (with a point and click tool) to indicate where each field is on the page and Autoscraping will scrape any similar page from the site.

And, FYI, study what Scrapinghub offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.


I've personally been involved in a project where we were building a generic Scrapy spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:

{
    "price": "//div[@class='price']/text()",  
    "description": "//div[@class='title']/span[2]/text()"
}

The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.

We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.

See also:

Majid Parvin
  • 4,499
  • 5
  • 29
  • 47
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I found another SO link somewhat related to this question, hence thought of sharing. Thanks Anyways :) – Praful Bagai Dec 27 '14 at 23:38
  • Just a bit more knowledge please. While scrapping the online store, do the scrapper scraps each and every product? I mean there are 10+ million products listed on Amazon, 1+million on Flipkart, and so on. I've heard that the price-comparison sites crawls/scraps the store on a daily basis. Do they scrap each and every product? Wouldn't it take ages to scrap every product? Your views please – Praful Bagai Dec 28 '14 at 22:07
  • 1
    @user1162512 it really depends on a web-scraping store/service. It is not that big of deal if you think about parallelizing the web-scraping using multiple `scrapyd` instances on multiple servers storing the data in sharded mongodb servers etc - it is basically a scaling problem that is at the end usually depends on how much money you have :). Unfortunately, I can't give you any information since I don't have enough experience in massive web-scraping on a daily basis.. – alecxe Dec 29 '14 at 20:37
4

For a lot of the pricing comparison scrapers, they will do the product search on the vendor site when a user indicates they wish to track a price of something. Once the user selects what they are interested in this will be added to a global cache of products that can then be periodically scraped rather than having to always trawl the whole site on a frequent basis

Mark Ruse
  • 387
  • 1
  • 4
  • 12