2

I am trying to scrape content from shopping sites then save it on my database in table Product. Scraping such content require to know the DOM structure of each site. Not only DOM Structure, but also the hierarchy of categories in the menu.

There are many solutions to achieve that by setup a configuration for each site, then look for specific html elements that contains (ex product name, price ,model,...) using regx, XPath or css selectors.

Is there any solution to avoid setup configuration for each site and scrape the product properties automatically?

There is a similar solutions that deal with news like Readability which looks for sequence of <p> tag and images. It is easier for news due the similarity between news site and the simple structure,

Phrogz
  • 296,393
  • 112
  • 651
  • 745
user968159
  • 126
  • 1
  • 10
  • You could automate the process: given a text value, find the text on the page and then [generate a CSS selector for the containing element](http://stackoverflow.com/a/4588211/405017). However, there's no guarantee that a generated selector will be stable. You could spend some days on a script that gathered multiple pages and used heuristics to attempt to find a common pattern…or you could just use your brain to generate a good selector based on obvious (to a human) patterns. – Phrogz Sep 01 '13 at 03:39

2 Answers2

1

If the website that you want to scrape has no general pattern for its html structure you must configure your script for every website.

ONLY if you are lucky you don't have to reconfigure your script.

ps: in generally web scrapers build their codes from scratch.

Drk_alien
  • 287
  • 6
  • 18
1

There is no magic bullet, however what you could do is use XSLT as the main "binding" between your site and your scraping program. XSLT support is built in with Html Agility Pack.

At least it will minimize the amount of work required when the site evolves or changes its structure, instead of relying only on pure procedural code. Changing XSLT (once you're used to it) text will not require compilation and is more equivalent to "configure" the system. But still, you'll have to define at least one XSLT file per target website (unless these website are built on the same software of course).

You may check this link for an XSLT example: Use HtmlAgilityPack to divy up a document

Community
  • 1
  • 1
Simon Mourier
  • 132,049
  • 21
  • 248
  • 298