5

I want to get automatically data about real estate from this site:

LINK

However, they do not have an api. How would you generally do that? I am thankfully for every response!

Cœur
  • 37,241
  • 25
  • 195
  • 267
user2051347
  • 1,609
  • 4
  • 23
  • 34
  • 5
    The search term you'll want to use is "web scraping". – T.J. Crowder Mar 14 '13 at 07:27
  • 2
    Take a look at this http://stackoverflow.com/questions/2861/options-for-html-scraping – Angelo.Hannes Mar 14 '13 at 07:27
  • 1
    Could there be a restriction by the server if I use such a package? – user2051347 Mar 14 '13 at 07:28
  • Both scraping and data-collection therein may be against the terms of use (however legally enforceable or not). –  Mar 14 '13 at 07:31
  • You can [get the HTML content from a specified URL][1], and then [read HTML file as DOM][2]. [1]: http://stackoverflow.com/questions/1414302/how-can-i-get-html-content-from-a-specific-url-on-server-side-by-using-java [2]: http://stackoverflow.com/questions/457684/reading-html-file-to-dom-tree-using-java – heobo Mar 14 '13 at 07:52

2 Answers2

2

You're going to have to download the page yourself, and parse through all the info yourself.

You possibly want to look into the Pattern class, look at some regex, and the URL and String classes will be very useful.

You could always download an html library to make it easier. Something like http://htmlparser.sourceforge.net/ possibly.

Very general question so obviously I can't provide relevant code, but this is known as scraping.

Austin
  • 4,801
  • 6
  • 34
  • 54
  • Do I have to download it or are there any ways to just send http requests? – user2051347 Mar 14 '13 at 07:30
  • @user2051347 You can request any info you want, but it's not just going to magically appear in your data. I'm not sure what you're asking. – Austin Mar 14 '13 at 07:31
  • 1
    I mean, that I just send and http request and get the html page back and just search in the code for a keyword, without really downloading the page. – user2051347 Mar 14 '13 at 07:33
  • @user2051347 What are you going to search if you don't download the page to some type of data in your program first? You can't find a keyword of something that isn't available to your program in the first place. – Austin Mar 14 '13 at 07:35
  • 2
    @user2051347 You really need to educate yourself on how HTTP, HTML and WWW work. – Jakub Zaverka Mar 14 '13 at 07:35
0

well this is how you get all the content from the page

then you can parse the page data as you want

package farzi;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URISyntaxException;

import org.apache.http.HttpException;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.DefaultHttpClient;

public class GetXMLTask
{
    public static void main(String args[]) 
    {
        try 
        {
            HttpClient httpClient = new DefaultHttpClient();
            HttpPost httpPost = new HttpPost("http://derstandard.at/anzeiger/immoweb/Suchergebnis.aspx?Regionen=9&Bezirke=&Arten=&AngebotTyp=&timestamp=1363245585829");
            HttpResponse response;
            StringBuilder builder= new StringBuilder();
            response = httpClient.execute(httpPost);
            System.out.println(response.toString());
            BufferedReader in = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), "UTF-8"));
            char[] buf = new char[1000];
            int l = 0;
                while (l >= 0) 
                {
                    builder.append(buf, 0, l);
                    l = in.read(buf);
                }
                System.out.println(builder.toString());
        } 
        catch (URISyntaxException e) {
            System.out.println("URISyntaxException :"+e);
            e.printStackTrace();
        } 
        catch (HttpException e) {
            System.out.println("HttpException :"+e);
            e.printStackTrace();
        } 
        catch (InterruptedException e) {
            System.out.println("InterruptedException :"+e);
            e.printStackTrace();
        } catch (IOException e) {
            System.out.println("IOException :"+e);
            e.printStackTrace();
        } 
    }
}