1

I want to crawl the data of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. The information which I need stands in an array in a JSON which comes in a response of a get request. You can see the settings and the response in this picture.

This is the link for the page with the JSON.

I found posts like this: Get a JSON object from a HTTP response, but if I use the getContent() method I only get the content of the page, not the full HTTP Body.
I even tried the getEntity() method and many more things but non of them worked.
Most other posts read JSON from pages that include the JSON in the source code, like here.
Any ideas how i could get the full JSON or better just the array?

Appreciate your help, kind regards.

Community
  • 1
  • 1
nerano
  • 43
  • 4
  • Finally found an answer for a Java application, see [here](http://stackoverflow.com/questions/36753737/read-full-content-of-a-web-page-in-java). – nerano Apr 24 '16 at 00:05

1 Answers1

1

I'm not sure of what you are trying to do, but I try to figure out. You want to grab this page contents with all the results of this search "247 Mitfahrgelegenheiten von Frankfurt nach Muenchen" right?

If so, you cannot just do a simple HTTP Get of this page, since the web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine, that is a browser or a Browser Developer Kit, like WebKit. Luckily there are several tools that helps you to do this, in several languages. The most simple is in JavaScript and it is PhantomJS

Getting that page is simple as doing in a javascript source file:

console.log('Loading a web page');
var page = require('webpage').create();
var url = 'http://phantomjs.org/';
page.open(url, function (status) {
  //Page is loaded!
  phantom.exit();
});

Of course, there is a small work to do, but PhantomJS has a lot of examples how to wait for page contents to load, executing javascript within and so getting the whole page contents as you see it in a real browser.

loretoparisi
  • 15,724
  • 11
  • 102
  • 146
  • 1
    Thank you for your answer loretoparisi. You are right, i try to read the full content of the website and use the data. I will look into PhantomJS and try to use it for my proposes. – nerano Apr 14 '16 at 09:00