1

I am trying to get a list of the content on a website (this one if anyone is interested). The layout has changed recently and now they do not load the content all at once, but with magic (js probably). I'm currently using JSoup to analyze the HTML, but im open to suggestions.

This is what i am getting:

<div class="row" data-v-6e4dbe9e>
 <div class="col-17 podcasts-group" data-v-6e4dbe9e>
  <div class="loading-spinner" data-v-6e4dbe9e>      //the devil himself
   <div class="spinner" data-v-ac3cb376 data-v-6e4dbe9e>
    <div class="rect1" data-v-ac3cb376></div>
    <div class="rect2" data-v-ac3cb376></div>
    <div class="rect3" data-v-ac3cb376></div>
    <div class="rect4" data-v-ac3cb376></div>
    <div class="rect5" data-v-ac3cb376></div>
   </div>
  </div>
  <div mode="in-out" class="transition-group row" data-v-6e4dbe9e>
   //Here should be stuff!
  </div>
 </div>
</div>

the code that achieves this:

String selector = "div.podcasts-items";
Elements elem = Jsoup.connect(link).get().select(selector)
System.out.println("html: "+elem.html());

This is what i would like to see (copied from inspect element after the page has loaded all the content):

<div class="row" data-v-6e4dbe9e>
 <div class="col-17 podcasts-group" data-v-6e4dbe9e>
  <!---->  //begone evil!
  <div mode="in-out" class="transition-group row" data-v-6e4dbe9e>
   <div class="col-17 col-md-8 center-margin" data-v-6e4dbe9e="">...</div>
   <div class="col-17 col-md-8 center-margin" data-v-6e4dbe9e="">...</div>
   <div class="col-17 col-md-8 center-margin" data-v-6e4dbe9e="">...</div>
   <div class="col-17 col-md-8 center-margin" data-v-6e4dbe9e="">...</div>
  </div>
 </div>
</div>

Google doesn't help much, because every content related to spinners etc. is about javascript.

solution:

due to the fact that JSoup only loads the HTML and does not execute any javascript the page never had a chance to load the content. You would have to use an actual browser engine or a webdriver like selenium to get the data to load.

For this specific problem I was able to get the content directly via loading the Json data through this webpage's API.

  • Looking for this? https://www.br.de/mediathek/podcast/api/podcasts?station=Bayern%201&limit=nolimit (found via browser's Network tab) –  Feb 14 '18 at 13:08
  • If you're getting only the HTML markup and the site is built with JS, you won't see anything. You need a more robust solution, for instance a headless browser like PhantomJS, that can load the full page including the scripts, interpret it all, then output the whole result. – Jeremy Thille Feb 14 '18 at 13:09
  • @Chris G wow, thanks. I was updating an old programm and didn't even know they have an API. I'll check it out. –  Feb 14 '18 at 13:20
  • Just try simple Javacript or JQUERY Ajax requests for extracting HTML . This process would let website to execute Scripts and populate the div's and now USE JSOUP for parsing HTML – Prince Arora Feb 14 '18 at 16:27

1 Answers1

1

If I understood your question then your best bet is to use Selenium driver. Link to similar question

SergejV
  • 468
  • 1
  • 8
  • 18
  • Thanks for the link. Now I know why it didn't work. I think I'll use their API for this specific problem though. –  Feb 14 '18 at 15:27