0

I am currently experimenting with jsoup and my goal is to extract data from this retail website, in the form of:

 Title: blabl
 Link: foba
 Grösse: 9999
 KP: FALSE
 Miete: TRUE
 Preis: 1923,23

I have written so far this test program:

public class jsoup_test {
    public static void main(String[] args) throws IOException {
        String url = "http://derstandard.at/anzeiger/immoweb/Suchergebnis.aspx?Regionen=9&Bezirke=&Arten=&AngebotTyp=&timestamp=1363305908912";
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements price = doc.select("tr.topangebot");
        Elements price1 = doc.select("tr.white");

        System.out.println("--------------------------------"); 
        System.out.println(price);  
        System.out.println("--------------------------------"); 
        System.out.println(price1); 

    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

}

However, this program gives me my data like that:

<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_InseratInfoTR" class="topangebot"> 
 <td class="BildTD" rowspan="2"> <a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&amp;FromTopAngebot=true"><img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoupload/2013/02/27/277515f7-f935-4a13-83fb-dbe3af930e28.jpg" alt="" /></a> </td> 
 <td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&amp;FromTopAngebot=true">Gehobene Qualit&auml;t, Design und exquisite Ausf&uuml;hrung: Dachausbau mit Weitblick und 100 m&sup2; Terrasse</a></strong><br /><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6847212&amp;FromTopAngebot=true">Wien 16.,Ottakring, Dachgeschoss</a><br /><span style="color: gray">Erstbezug, K&uuml;che, Parkettboden, Hauptmiete, Terrasse, Lift, Keller, Altbau, Kabel/Sat-TV, Barrierefrei</span> </td> 
 <td class="GroessenTD" rowspan="2"> <span class="strong">125 m&sup2;</span><br /><span class="strong">4&nbsp;</span>Zimmer </td> 
 <td class="PreisTD" style="border:none;"> <span class="light">Miete</span>&nbsp;2.190&nbsp;<br /> </td> 
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl02_MerklisteTR" class="topangebot"> 
 <td class="merkliste"> </td> 
</tr>
<tr id="ctl00_Body_mc_cErgebnisListe1_ctl03_InseratInfoTR" class="topangebot"> 
 <td class="BildTD" rowspan="2"> <a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&amp;FromTopAngebot=true"><img border="0" src="http://images.derstandard.at/t/22/upload/imagesanzeiger/immoimporte/justimmo2/files.justimmo.at/public/pic/big/AEs_YegpKC.JPG" alt="" /></a> </td> 
 <td class="TitleTD" rowspan="2"> <span class="neu">TOP!</span> <strong><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&amp;FromTopAngebot=true">HS-IMMO: 14. PREISSENSATION Eckzinshaus 1414m&sup2; Leerstand - Gesamtnutzfl&auml;che 1670m&sup2; + Rohdachboden ca. 700m&sup2; erzielbar ( Baubescheid ) € 1555.-/m&sup2; NFL</a></strong><br /><a href="/anzeiger/immoweb/Detail.aspx?InseratID=6871213&amp;FromTopAngebot=true">Wien 14.,Penzing, Zinshaus</a><br /><span style="color: gray">Parkettboden, Altbau, Kabel/Sat-TV</span> </td> 
 <td class="GroessenTD" rowspan="2"> <span class="strong">1.670 m&sup2;</span><br /> </td> 
 <td class="PreisTD" style="border:none;"> <span class="light">KP</span>&nbsp;2.590.000&nbsp;<br /> </td> 
</tr>...

Which is not in a human readable format. Therefore my question is. How to get jsoup, that it extracts the data DIRECTLY in the Format I want?

Thx for your replies?

maximus
  • 11,264
  • 30
  • 93
  • 124

3 Answers3

1

For example for selecting title you need to do something like this

String title = doc.select("tr.topangebot > td.TitleTD").first.text();
MariuszS
  • 30,646
  • 12
  • 114
  • 155
0

you can navigate the page using DOM if you know the page structure:

http://jsoup.org/cookbook/extracting-data/dom-navigation

This question has a bunch of good web scrapers

Web scraping with Java

Community
  • 1
  • 1
Will
  • 6,179
  • 4
  • 31
  • 49
0

I like to use Jsoup because it's methods were literally built for DOM traversal. So, if you are good at HTML, CSS, and Jquery, this library was built for you. Yes, the Jsoup approach may be too fast. Yes, it may not suit your needs. But, when it comes to gathering any type of information from any type of website, Jsoup is flexible enough to meet your needs.