0

I wish to get all the content of this website http://globoesporte.globo.com/temporeal/futebol/20-10-2013/botafogo-vasco/

specially the elements located at the bottom right of the screen called 'estatisticas'

I've tried to download FireBug and get the HTML file using jsoup but it didn't work. Jsoup couldn't find just the content I wanted, which made me get a little bit annoyed. Idk which techniques/api's or whatever I'm supposed to use to get the whole data from the website and I appreciate if you guys help me.

Thanks in advance.

lucasdc
  • 1,032
  • 2
  • 20
  • 42
  • You may try using Apache HttpClient to connect to the site using a GET request, then retrieve all the content in `String` and retrieve the data from this giant `String` manually. – Luiggi Mendoza Oct 22 '13 at 05:35
  • See this : http://stackoverflow.com/questions/3202305/web-scraping-with-java/6775957#6775957 – Nishan Oct 22 '13 at 05:36

3 Answers3

2

The 'estatisticas' are loaded after the page load by an AJAX call - you can't scrape them from the page because they're not there.

You can, however, get them in JSON format at this address: http://globoesporte.globo.com/temporeal/futebol/20-10-2013/botafogo-vasco/estatisticas.json

0

for that you need to explore html parser like jsoup and HTML parser . If you want all the code including html tags and then you also try this code

URL url = new URL("http://www.example.com");
InputStream io = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(io));
String str ="";
while((str=br.readLine())!=null)
{
System.out.println(str);
}
Simmant
  • 1,477
  • 25
  • 39
0

if you intend to crawl a website, you can use HttpClient, which can provide almost all the HTTP protocol operation. Here's a code snippet which may suits what you want:

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://globoesporte.globo.com/temporeal/futebol/20-10-2013/botafogo-vasco/");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
if (entity != null) {
    InputStream instream = entity.getContent();
    try {
        // do something useful
    } finally {
        instream.close();
    }
}

P.S. the maven for HttpClient:

<dependency>
    <groupId>commons-httpclient</groupId>
    <artifactId>commons-httpclient</artifactId>
    <version>3.1</version>
</dependency>

Hope it helps:)

Judking
  • 6,111
  • 11
  • 55
  • 84