0

I am trying to scrape data using java from a site that uses Cloudflare Enterprise Package protection. I haven't been able to find very much information about this DDOS protection system on the Web but here's what I believe is happening (from inspecting the HTTP responses and javascript)

  1. Client sends GET request to the server.
  2. Server determines that a specific cookie is absent from the GET request and returns an HTTP 503 response along with some HTML.
  3. The client's browser automatically runs the javascript on that response, solves a math problem and sends a new GET request with the solution to that problem appended as a query string.
  4. Server responds with an HTTP 302 redirect response and the necessary cookie.
  5. Browser sends GET request with the correct cookie and the server gives an HTTP 200 response and all is well.

My question has to do with getting the initial response stream in java. I create the connection, add the user agent, and try to open the stream. As expected, I receive the 503 response. However, java considers this an exception and will not let me access the HTML that I believe should be appended to this response. Does anyone know how to get the HTML? Or perhaps it is impossible to append HTML to a 503 and I just don't understand correctly what is going on?

Thanks!

G-Cam
  • 261
  • 3
  • 12
  • You need help to crack a robot protection? Perhaps you instead ask the site owners for permission to access their data? They might have an API. – mplungjan Nov 30 '15 at 06:47

1 Answers1

1

If you have other respone than OK, you need to use .getErrorStream() to read the response

You can do it like this:

HttpURLConnection c = ....;
InputStream is;
if ((c.getResponseCode()/100)==2) {
    is = c.getInputStream();
} else {
    is = c.getErrorStream(); // instead of normal Input Stream
}

// read your HTML from is
Krzysztof Cichocki
  • 6,294
  • 1
  • 16
  • 32