0

I've been trying to get a String from a website by using Java. Here is my code for it:

protected String doInBackground(String... urls) {
    try {
        gotten_next_date = Jsoup.connect("https://www.vividseats.com/nba-basketball/toronto-raptors-schedule.html")
                    .get().getElementsByClass("productionsDate").first().text();
        full_next = gotten_next_date;

        return full_next;
    } catch (IOException e) {
        return "Unable to retrieve data. URL may be invalid.";
    }

I have written this yesterday and it worked perfectly, but when I tried it today, it for some reason gave me this error:

java.lang.NullPointerException: Attempt to invoke virtual method 'java.lang.String org.jsoup.nodes.Element.text()' on a null object reference

I don't understand why that is happening. Can somebody help?

EDIT: I believe the error is not happening because of creating the variable, but because of not receiving the Element from the website. I think this question is wrongly labeled as duplicate.

pantank14
  • 175
  • 1
  • 10
  • Possible duplicate of [What is a NullPointerException, and how do I fix it?](https://stackoverflow.com/questions/218384/what-is-a-nullpointerexception-and-how-do-i-fix-it) – Geno Chen Feb 16 '19 at 18:38
  • @GenoChen My problem is not the same as the possible duplicate, because it's not about creating a variable, but about not receiving a certain variable from a website. – pantank14 Feb 16 '19 at 18:46

1 Answers1

3

What you did should work fine. I've ran it once, but then it stopped working.

The problem is that website has an anti scraping mechanism that blocks you if you do too many requests on their site.

What I would recommend you do is:

  1. add userAgent() in order to identify yourself as a bot scraper.
  2. read their Terms of Service to check if you are allowed to scrape their site.
  3. send them an email telling what are you intentions and if they are okay scraping parts of their site.

By the way, if you want to debug what is happening, how I did it is just change the Jsoup calls as:

String gotten_next_date =
                Jsoup.connect("https://www.vividseats.com/nba-basketball/toronto-raptors-schedule.html").get().html();

This returns the html of the requested page, which if you look, does not have anything interesting.

<!doctype html>
<html>
 <head> 
  <meta NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> 
  <meta http-equiv="cache-control" content="max-age=0"> 
  <meta http-equiv="cache-control" content="no-cache"> 
  <meta http-equiv="expires" content="0"> 
  <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT"> 
  <meta http-equiv="pragma" content="no-cache"> 
  <meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=291c6193-eb12-4e96-b1cd-23ba9a75e659&amp;httpReferrer=%2Fnba-basketball%2Ftoronto-raptors-schedule.html"> 
  <script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script> 
  <script type="text/javascript" src="/vvdstsdstl.js" defer></script>
  <style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#twsyxyabbqdwrxzyzxesxywvwuzbszeeacwd{display:none!important}</style> 
  <script>var w=window;if(w.performance||w.mozPerformance||w.msPerformance||w.webkitPerformance){var d=document;AKSB=w.AKSB||{},AKSB.q=AKSB.q||[],AKSB.mark=AKSB.mark||function(e,_){AKSB.q.push(["mark",e,_||(new Date).getTime()])},AKSB.measure=AKSB.measure||function(e,_,t){AKSB.q.push(["measure",e,_,t||(new Date).getTime()])},AKSB.done=AKSB.done||function(e){AKSB.q.push(["done",e])},AKSB.mark("firstbyte",(new Date).getTime()),AKSB.prof={custid:"632139",ustr:"",originlat:"0",clientrtt:"124",ghostip:"72.247.179.76",ipv6:false,pct:"10",clientip:"79.119.120.57",requestid:"418cf776",region:"26128",protocol:"",blver:14,akM:"b",akN:"ae",akTT:"O",akTX:"1",akTI:"418cf776",ai:"275708",ra:"false",pmgn:"",pmgi:"",pmp:"",qc:""},function(e){var _=d.createElement("script");_.async="async",_.src=e;var t=d.getElementsByTagName("script"),t=t[t.length-1];t.parentNode.insertBefore(_,t)}(("https:"===d.location.protocol?"https:":"http:")+"//ds-aksb-a.akamaihd.net/aksb.min.js")}</script> 
 </head> 
 <body> 
  <div id="distilIdentificationBlock">
   &nbsp;
  </div>   
 </body>

Update: (from zack6849) If you look closely inside the head tag, the last meta tag hints that you are being redirected to a captcha page:

<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=291c6193-eb12-4e96-b1cd-23ba9a75e659&amp;httpReferrer=%2Fnba-basketball%2Ftoronto-raptors-schedule.html"> 

If you also search a bit for distilIdentificationBlock which is found in the html, you can see that it's related to scrapers being blocked.

Hope it helps you get a better understanding of what is happening.

Andrei Sfat
  • 8,440
  • 5
  • 49
  • 69