1

I want to access instagram pages without using the API. I need to find the number of followers, so it is not simply a source download, since the page is being built dynamically.

I found HtmlUnit as a library to simulate the browser, so that the JS gets rendered, and I get back the content I want.

HtmlPage myPage = ((HtmlPage) webClient.getPage("http://www.instagram.com/instagram"));

This call however results in the following exception:

Exception in thread "main" com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 403 Forbidden for http://d36xtkk24g8jdx.cloudfront.net/bluebar/3a30db9/scripts/webfont.js

So it can't access that script, but if I'm interpreting this correctly, it's just for font loading, which I don't need. I googled how to tell it to ignore parts of the page, and found this thread.

webClient.setWebConnection(new WebConnectionWrapper(webClient) {
              @Override
              public WebResponse getResponse(final WebRequest request) throws IOException {
                if (request.getUrl().toString().contains("webfont")) {
                  System.out.println(request.getUrl().toString());
                  return super.getResponse(request);
                } else {
                  System.out.println("returning response...");
                  return new StringWebResponse("", request.getUrl());
                }
              }
            });

With that code, the exception goes away, but the source (or page title, or anything else I've tried) seems to be empty. "returning response..." is printed once.

I'm open to different approaches as well. Ultimately, entire page source in a single string would be good enough for me, but I need the JS to execute.

Community
  • 1
  • 1
Innkeeper
  • 663
  • 1
  • 10
  • 22
  • Did you connect to instagram programatically? how do you did that? i already tried htmlUnit, httpClient, URLConnect, but no result – Progs Sep 25 '15 at 21:59

1 Answers1

2

HtmlUnit with JS is not a good solution because Javascript engine Mozilla Rhino for many JS page not work and have a lot of problem.

You can use PhantomJs like a webdriver:

PhantomJs

DevOps85
  • 6,473
  • 6
  • 23
  • 42