1

I'm trying to scrape some information from a webpage. My problem is the return I get doesn't contain what I´m looking for.

If I inspect the source code of the web I find an empty section

<section id="player-controller">
</section>

But if I inspect the the elements I want data from, they appear inside that section

Since it's generated dynamically I tried using HTMLUnit, but I stil can't get it. Maybe I'm looking at this the wrong way.

Is there any way I can get the code with HTMLUnit or should I use a different tool?

Solved

By using HTMLUnit and making the process stop some time before printing the page I got it to printing the missing content

WebClient webclient = new WebClient();
    HtmlPage currentPage = webclient.getPage("https://www.dubtrack.fm/join/chilloutroom");
    Thread.sleep(2000);
    System.out.println(currentPage.asXml());
ipop
  • 34
  • 4

2 Answers2

0

If you examine the text of the page as it is first loaded, the dynamic contents won't be loaded yet. The javascript in callScraper.html will call another page and then wait two seconds before reading the contents of the HTML element. Timing could be tricky here. I hope the following code will be helpful.

callScraper.html

<!DOCTYPE html>
<head>
<title>Call test for scraping</title
<meta charset="UTF-8" />
<script>
var newWindow;
var contents;
function timed() {
contents.value = contents.value + "\r\n" +"function timed started" + "\r\n";
contents.value = contents.value + "\r\n" + newWindow.document.getElementById("player-controller").innerHTML;
}
function starter() {
// alert("Running starter");
contents = document.getElementById("contents");
newWindow = window.open("scraper.html");
contents.value = contents.value + "\r\nTimer started\r\n";
setTimeout(timed, 2000);
}
window.onload=starter;
</script>
</head>
<body>
<p>This will open another page and then diplay an element from that page.</p>
<form name="reveal">
<textarea id="contents" cols="50" rows="50"></textarea>
</form>
</body>
</html>

scraper.html

<!DOCTYPE html>
<head>
<title>Test for scraping</title>
<meta charset="UTF-8" />
<script>
var section;
function starter() {
section = document.getElementById("player-controller");
// alert(":"+section.innerHTML+";");
section.innerHTML = "<p>inner text</p>";
// alert(":" +section.innerHTML + ":");
}
window.onload = starter;
</script>
</head>
<body>
<p>See http://stackoverflow.com/questions/37513393/scrapping-data-from-webpage-java-htmlunit</p>
<section id="player-controller">

</section>
</body>
</html>
Bradley Ross
  • 445
  • 2
  • 8
  • Your idea worked. I implemented it in java, calling the page and waiting a few seconds before printing the code. – ipop May 30 '16 at 09:39
-1

You can try jsoup for

inspect the elements I want data from, they appear inside that section generated dynamically

The API allows extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Maybe you need to perform some actions, before the data is AJAX loaded.

ekostadinov
  • 6,880
  • 3
  • 29
  • 47
  • 1
    I've tried Jsoup too, from what I've understood it doesn't support javascript/ajax, which I'm guessing is what it uses to fill the blanks. Im' trying to send GET petitions for the data, and seems to be working at first, still need to test it a litle bit more. – ipop May 29 '16 at 18:52
  • Looks like a combination with [headless browser](http://stackoverflow.com/questions/16852660/how-to-scrape-ajax-loaded-content-with-jsoup) might do the trick. – ekostadinov May 29 '16 at 19:37