0

I am trying to scrape a website in Java, to extract some percentages from a table, which is this one.

These percentages are rendered after the HTML source is processed. So we can know these elements are rendered via Javascript, which makes scraping harder (ops, problem)

So this is the difference between the element BEFORE being rendered:

<div class="user_forecasts" id="57464" />

and AFTER being rendered:

<div class="user_forecasts" id="57464"> <b>1</b>
  <div class="percents">61% | 25% | 14%</div>
</div>

Obviously, I wanna get the "61% | 25% | 14%" string, and the rest of percents in the table...

Well, in fact, yes, it's rendered by Javascript, and I found the .js file and luckily I found the interesting part:

// ajax user_forecast load - one call
if ($('div.user_forecasts').length > 0) {
  $.ajax({
    url: '/vote/percentage',
    global: false,
    type: 'GET',
    data: {
      a: $('#jornadaq').val()
    },
    success: function(percentages) {
      perc_obj = eval(percentages);
      $('div.user_forecasts').each(function(ind, val) {
        if (ind == 14) {
          $(this).html("<b>" + perc_obj[ind].value + "</b><div class='percents'>" + perc_obj[ind].porcent + "%" + "</div>");
        } else {
          $(this).html("<b>" + perc_obj[ind].forecast + "</b><div class='percents'>" + perc_obj[ind].local + "% | " + perc_obj[ind].tie + "% | " + perc_obj[ind].visitor + "%" + "</div>");
        }
      });
    }
  });
}

As you see, it's an AJAX call. I checked if I could get the percentages by pasting this code into the Chrome Developer Virtual Machine, and yes, I got what I wanted: the group of elements which contains the data I need for my program.

Please look this ScreenShot (Chrome Developer Virtual Machine)

The thing is I don't know how should I tell Java to code this XML Http Request and then get this data. What libraries do you recommend for this, and how could I use them especifically for this case?

Javier
  • 3
  • 3
  • Java and Javascript are two completely separate languages - there's no Java in your code, so I assume you mean javascript every time you write Java? – Jaromanda X Feb 08 '18 at 21:57
  • Check out headless browsers (for Java) - for example [this question](https://stackoverflow.com/questions/11634747/headless-browser-with-full-javascript-support-for-java) – James Feb 08 '18 at 21:57
  • @JaromandaX I know haha, I'm trying to look at this Javascript Ajax Call to get an idea to do it in Java (via Eclipse). The thing is to send this request from Java (as it's done with JavaScript on the website) – Javier Feb 08 '18 at 22:02
  • Oh, right - carry on :p – Jaromanda X Feb 08 '18 at 22:05

1 Answers1

0

From java, you would call the GET URL "/vote/percentage" just like getting any other HTML page, and parse the JSON result that comes back. There are so many ways of doing this - and looks like you are are already doing this (getting a HTML page for a URL for scraping) so you can use the same method of getting this URL.

The only difference in calling this GET URL to get JSON, and calling a URL to get HTML is the data format coming back. The former returns JSON, the latter HTML.

Ari Singh
  • 1,228
  • 7
  • 12
  • You'll have to parse lines like "$.ajax({ . url: '/vote/percentage' ..." that are there in the html or js file and extract the ajax url e.g. /vote/percentage in this case. Just like you would parse the link in html. – Ari Singh Feb 15 '18 at 00:57
  • Yep, the thing is that I tried using the URL "vote/percentage" and what I got is a redirection to the main page, so I cannot parse any JSON :( – Javier Feb 15 '18 at 19:58
  • If the browser can do it - you can do it. That link might be looking at something like the referer or some thing else in header. Browser developer's tool should help as to figure out what the URL needs. – Ari Singh Feb 15 '18 at 20:31
  • Yes, that was the problem, I didn't entered the HTTP headers. These kind of problems are because I'm new at network coding. haha Thanks! – Javier Feb 17 '18 at 19:20
  • Solved. It was because of the Headers. Declared headers and done. – Javier May 06 '18 at 12:45