6

I'm creating an application which will enable me to fetch values from a specific website to the console. The value is from a <span> element and I'm using JSoup.

My challenge has to do with this error:

Error fetching URL

Here is my Java code:

public class TestSl {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("https://stackoverflow.com/questions/11970938/java-html-parser-to-extract-specific-data").get();
        Elements spans = doc.select("span[class=hidden-text]");
        for (Element span: spans) {
            System.out.println(span.text());
        }
    }
}

And here is the error on Console:

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=Java Html parser to extract specific data? at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216) at TestSl.main(TestSl.java:19)

What am I doing wrong and how can I resolve it?

nyedidikeke
  • 6,899
  • 7
  • 44
  • 59
PICKAB00
  • 288
  • 2
  • 9
  • 23
  • 1
    The 403 Forbidden error is an HTTP status code which means that accessing the page or resource you were trying to reach is absolutely forbidden for some reason. – ryekayo Apr 21 '16 at 20:43
  • So in basic, there is no way i could fetch that data? maybe using some alternatives? Or is it that the server/Website does not allow any HTML Phrasers to fetch the data? – PICKAB00 Apr 21 '16 at 20:46
  • 1
    Not sure if the website allows you to use HTML parsers.. But most likely the HTML parser works off of port 443 or 80 so I don't think that would be the case. Might be the way you are implementing the code.... – ryekayo Apr 21 '16 at 20:51
  • Thank you. I have one more issue. So i tried with google (again, span and its class name). I do not get the error but there is no result on my console. I have re-read my code enough times but i could not figure out where i went wrong. Any suggestions for that? – PICKAB00 Apr 21 '16 at 20:55

1 Answers1

11

Set the user-agent header:

.userAgent("Mozilla")

Example:

Document document = Jsoup.connect("https://stackoverflow.com/questions/11970938/java-html-parser-to-extract-specific-data").userAgent("Mozilla").get();
Elements elements = document.select("span.hidden-text");
for (Element element : elements) {
  System.out.println(element.text());
}

Stack Exchange

Inbox

Reputation and Badges

source: https://stackoverflow.com/a/7523425/1048340


Perhaps this is related: https://meta.stackexchange.com/questions/277369/a-terms-of-service-update-restricting-companies-that-scrape-your-profile-informa

Community
  • 1
  • 1
Jared Rummler
  • 37,824
  • 19
  • 133
  • 148
  • Thanks. Finally worked. Could you elaborate please. Where did i go wrong? – PICKAB00 Apr 21 '16 at 21:02
  • Well i am having one more issue :/ The stackoverflow example works great. But i have another website which i am not getting any results. I do not get the error anymore but no values are spitted on to console. https://www.binary.com/trading?l=EN In that page there is this span where it stores Numeric values. Right next to the small graph. The class changes as the value goes up and down. now there is an ID called "spot". I used both class name and ID on my code but i get no results on my console. Could you suggest any reason why? – PICKAB00 Apr 21 '16 at 21:08
  • 1
    Perhaps StackOverflow is sniffing the user-agent. I know they are actively trying to prevent web scraping abuse at the moment. Here is some good advice: https://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/ – Jared Rummler Apr 21 '16 at 21:09
  • If you have another question/problem I would suggest providing more details and creating a new post. :) – Jared Rummler Apr 21 '16 at 21:13
  • Yeah i get that. I mean web scrapping could lead to a huge misunderstanding. But could you suggest why i get data from stackoverflow and not from the binary website? Is there any legit reason for that? Or is their server denying access? And that is the best i could do with explanation :p I mean i get the values from stackoverflow example but i do not get any values from the Binary website even when i use class name or ID. – PICKAB00 Apr 21 '16 at 21:15
  • I briefly looked at the site. The div is empty when you download the page. – Jared Rummler Apr 21 '16 at 21:17
  • Div is empty because it keeps refreshing every second i believe. Please tell me i am right on this. And if so, is there anyway to fetch the value? – PICKAB00 Apr 21 '16 at 21:18
  • This should send you in the right direction: http://stackoverflow.com/a/7489380/1048340 Using Jsoup to parse a page that is updated by javascript won't work. – Jared Rummler Apr 21 '16 at 21:24
  • Could you please state if the span on Binary website is using JavaScript or not? Is it the reason? If so than i will have to look for another option besides JSoup. Thank you – PICKAB00 Apr 21 '16 at 21:43