1

I want to access this webpage: https://www.google.com/trends/explore#q=ice%20cream and extract the data within in the center line graph. The html file is(Here, I only paste the part that I use.):

  <div class="center-col">
       <div class="comparison-summary-title-line">...</div>
       ...
       <div id="reportContent" class="report-content">
            <!-- This tag handles the report titles component -->
       ...
       <div id="report">
         <div id="reportMain">
           <div class="timeSection">
              <div class = "primaryBand timeBand">...</div>
                  ...
                 <div aria-lable = "one-chart" style = "position: absolute; ...">
                 <svg ....>
                 ...
                 <script type="text/javascript">
                 var chartData = {...}

And the data I used is stored in the script part(last line). My idea is to get the class "report-content" first, and then select script. And my code follows as:

  String html = "https://www.google.com/trends/explore#q=ice%20cream";
  Document doc = Jsoup.connect(html).get();

  Elements center = doc.getElementsByClass("center-col");
  Element report = doc.getElementsByClass("report-content");

  System.out.println(center);
  System.out.println(report);

When I print "center" class, I can get all the subclasses content except the "report-content", and when I print the "report-content", the result is only like:

      <div id="reportContent" Class="report-content"></div>

And I also try this:

  Element report = doc.select(div.report-content).first();

but still does not work at all. How could I get the data in the script here? I appreciate your help!!!

dimo414
  • 47,227
  • 18
  • 148
  • 244
beepretty
  • 1,075
  • 3
  • 14
  • 20
  • See [Fetch contents(loaded through AJAX call) of a web page](http://stackoverflow.com/questions/20633294/fetch-contentsloaded-through-ajax-call-of-a-web-page). –  Apr 19 '16 at 04:16

2 Answers2

1

Try this url instead:

https://www.google.com/trends/trendsReport?hl=en&q=${keywords}&tz=${timezone}&content=1

where

  • ${keywords} is an encoded space separated keywords list
  • ${timezone} is an encoded timezone in the Etc/GMT* form

DEMO

SAMPLE CODE

String myKeywords = "ice cream";
String myTimezone = "Etc/GMT+2";

String url = "https://www.google.com/trends/trendsReport?hl=en&q=" + URLEncoder.encode(keywords, "UTF-8") +"&tz="+URLEncoder.encode(myTimezone, "UTF-8")+"&content=1";

Document doc = Jsoup.connect(url).timeout(10000).get();
Element scriptElement = doc.select("div#TIMESERIES_GRAPH_0-time-chart + script").first();

if (scriptElement==null) {
   throw new RuntimeException("Unable to locate trends data.");
}

String jsCode = scriptElement.html(); 
// parse jsCode to extract charData...

References:

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329
0

Trying getting the same by Id, you would get the complete tag

KP.
  • 393
  • 1
  • 12
  • Thank you! When I use "doc.select(div.reportContent)", which is the id, the result is null. When I use "doc.select(div.report-content)", which is the class, the result is without the content. And then I cannot get the script within this class either. – beepretty Apr 19 '16 at 04:06
  • did you try getElementById("reportMain")? – KP. Apr 19 '16 at 07:54