do I need to customize my code
Yes, sure. It is not only because the web-sites have different HTML schema. It is also about the mechanisms involved in loading/rendering the page: some sites use AJAX to load partial content of a page, others let the javascript fill out the placeholders on the page which makes it harder to scrape - there can be lots and lots of differences. Others would use anti-web-scraping techniques: check your headers, behavior, ban you after hitting a site too often, etc.
I've also seen cases when prices were kept as images, or obfuscated with a "noise" - different tags inside one another that were hidden using different techniques, like CSS rules, classes, JS code, "display: None" etc - for an end-user in a browser the data looked normally, but for a web-scraping "robot" it was a mess.
want to know how price-comparison sites scrape data from all the online stores?
Usually, they use APIs whenever possible. But, if not, web-scraping and HTML parsing is always an option.
The general high-level idea is to split the scraping code into two main parts. The static one is a generic web-scraping spider (logic) that reads the parameters or configuration that is passed in. And a dynamic one - an annotator/web-site specific configuration - is usually field-specific XPath expressions or CSS selectors.
See, as an example, Autoscraping tool provided by Scrapinghub
:
Autoscraping is a tool to scrape web sites without any programming
knowledge. You just annotate web pages visually (with a point and
click tool) to indicate where each field is on the page and
Autoscraping will scrape any similar page from the site.
And, FYI, study what Scrapinghub
offers and documents - there is a lot of useful information and a set of different unique web-scraping tools.
I've personally been involved in a project where we were building a generic Scrapy
spider. As far as I remember, we had a "target" database table where records were inserted by a browser extension (annotator), field annotations were kept in JSON:
{
"price": "//div[@class='price']/text()",
"description": "//div[@class='title']/span[2]/text()"
}
The generic spider received a target id as a parameter, read the configuration, and crawled the web-site.
We had a lot of problems staying on a generic side. Once a web-site involved javascript and ajax, we started to write site-specific logic to get to the desired data.
See also: