How to filter the crawler data using java?

Question

we already fetched the URLs and stored in the db using jsoup lib.Now we are looking to extract the data and store in db,but we are looking only specific field,rather than storing the whole page. for example :http://www.flipkart.com/shoes/ when we fetch this link, we need field like brands ,prices, reviews etc.. using java code!! Please help !

score -2 · Answer 1 · edited Aug 11 '16 at 04:49

-2

There are two ways you can filter out the whole content,

Apply Regex on the response content and extract the needed fields.
Using xpath you can extract the needed fields (Preferred and recommended way of parsing).

Ex: 1 - Regex

Generate the regex pattern for your selected page.
Get the response as String and apply the pattern and retrieve the data.

Ex: 2 - XPath

Identify the methodolgy to locate each and every html element uniquely (Or list)
Get the response as html/xml form and apply the xpath on the retrieved content and get the data.

edited Aug 11 '16 at 04:49

Vikrant Kashyap

6,398
3
32
52

answered Aug 02 '16 at 05:56

Hakuna Matata

755
3
13

1

Regex should not be used to parse html. http://stackoverflow.com/a/6751339/1176178 – Zack Aug 02 '16 at 13:08

How to filter the crawler data using java?

1 Answers1

Ex: 1 - Regex

Ex: 2 - XPath