1

I am using Jsoup library to extract data from this site: Tom's hardware benchmarks

I use this code for connect to the site and extract data:

  protected Void doInBackground(Object[] params) {
    doc = Jsoup.connect(url).maxBodySize(Integer.MAX_VALUE).header("Accept-Encoding", "gzip").userAgent("Dalvik").method(Connection.Method.GET).timeout(Integer.MAX_VALUE).get();


    try {

    if (doc != null) {

                    css_text = doc.select("div[class=clLeft] label[for]");

                    for (int i = 0; i < css_text.size(); i++)
                        elem1[i] = css_text.eq(i).text();

                    css_text = doc.select("ul[style=margin-left:0px;] span");

                    css_score = doc.select("div[class=clRight clearfix]");


                    for (int j = 0; j < css_text.size(); j++) {
                        elem2[j] = css_text.eq(j).text();
                        score[j] = css_score.eq(j).text();

                        processori_score_arraylist.add(elem1[j] + "\n" + elem2[j] + "   " + score[j]);


                        }

                    }


                } catch (IOException e) {
                    e.printStackTrace();
                }


                return null;


            }

            @Override
            protected void onPostExecute(Void aVoid) {
                super.onPostExecute(aVoid);

                processori_score_listview.setAdapter(adapter);

            }
        }

    }

I read that there is a default limitation of 1MB, and that can truncate the webpage. This webpage instead don't appear to 1MB so I use first a value by me but don't work always. For a problem that I don't understand, when i am in debug mode and see the variable doc as Document, the webpage is download all sometime and other sometimes not. I don't understand why. Then I try to change the value of maxBodySize to 0 and then to Integer.MAX_VALUE and also timeout value reading other post and search of Internet but it doesn't resolve the problem. Anybody can suggest me cause or solution of the problem? I hope that is clear what the problem is, if not I am available for doubts.

Other post of this problem that I found:

jsoup don't get full data

JSOUP not downloading complete html if the webpage is big in size. Any alternatives to this or any workarounds?

Here the HTML page how is truncated:

 <!doctype html>
    <html>
     <head> 
      <meta name="ROBOTS" content="NOINDEX, NOFOLLOW"> 
      <meta http-equiv="cache-control" content="max-age=0"> 
      <meta http-equiv="cache-control" content="no-cache"> 
      <meta http-equiv="expires" content="0"> 
      <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT"> 
      <meta http-equiv="pragma" content="no-cache"> 
      <meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/charts/cpu-charts-2015/-01-CinebenchR15,Marque_fbrandx14,3693.html&amp;distil_RID=1CB642F0-76B5-11E5-9B22-93799C16BE3F&amp;distil_TID=20151019225954"> 
      <script type="text/javascript">
        (function(window){
            try {
                if (typeof sessionStorage !== 'undefined'){
                    sessionStorage.setItem('distil_referrer', document.referrer);
                }
            } catch (e){}
        })(window);
    </script> 
      <script type="text/javascript" src="/destilar-fbxcdbtcwcebrsxtw.js" defer></script>
      <style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#ssxfwzexyqctzdfy{display:none!important}</style>
     </head> 
     <body> 
      <div id="distil_ident_block">
       &nbsp;
      </div>   
     </body>
    </html>
Community
  • 1
  • 1
Paul91
  • 11
  • 4

1 Answers1

0

Some reasons I see:

1) Code readiness

Don't forget to clean your code. It looks like quite messy from your question. The strange behavior may live there.

2) Random downtime

The code may hit some random downtime from the server. In your case, I would strengthen the error management.

Document doc=null;

try {
    doc = Jsoup.connect(url) //
           .timeout(0) // Relax the server by according it infinite time...
           .maxBodySize(0) // We don't know the size of the server response...
           .header("Accept-Encoding", "gzip") //
           .userAgent("Dalvik") //
           .get();

    // * Extract data from doc
    // If something is missing, raise an exception
    // or write a code that can accomodate with the missing data

} catch(Throwable t) {
   // Using Throwable may seem extreme here, however you'll quickly see what's going on

   // Carefully log what happened
   log.error("Something BAD happened...", t);

   // Ultimately, if something is present in the document, dump it for later investigation
   if (doc!=null) {
      dump(doc.outerHtml());
   }
}

3) Website protection

Some website have clever anti-webscraping protection. So when you fetch urls, do it slowly. The code should mark a random pause between 3000 to 5000 milliseconds on each fetch. It looks like more human. You can also use some proxy for changing your ip adress.

Stephan
  • 41,764
  • 65
  • 238
  • 329