Smart data extraction algorithm from websites

Question

I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.

Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:

image: biggest image is always the main image of deal
discount: discount is always a number between 50 and 99 and always has a "%" symbol
coordinates: is always in decimal numbers so I get it with regex

How do I get the following items?

Name of deal?
Price?

Do you know of any data extraction algorithms that can be helpful?

score 1 · Accepted Answer · edited May 23 '17 at 11:55

1

I'd suggest you to use XPath based scraper. For example Web-Harvest

Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.

Look at this topic: Are there APIs for text analysis/mining in Java?

edited May 23 '17 at 11:55

Community

1
1

answered Jun 14 '12 at 08:44

stemm

5,960
2
34
64

If you have access to html source of target sites, you can construct XPath expressions for scraper. You can do that because usually position of title, price, and other elements of text might been related to html tags – stemm Jun 14 '12 at 08:50
yes, i have about 10,000 sites (group buying) but i do not want to create a 10,000 scrapers for every site... so i need some unique solution for all that sites – Michael Froter Jun 14 '12 at 10:21
as i say, i have solution for image and disscunt but for other element i dont have good solution – Michael Froter Jun 14 '12 at 10:21
Look at state-machine parser for recognizing templated parts of raw texts http://stackoverflow.com/questions/6800509/are-there-apis-for-text-analysis-mining-in-java/6800681#6800681 – stemm Jun 14 '12 at 11:57

Smart data extraction algorithm from websites

1 Answers1