0

NOTE: This is not a duplicate question. I'm aware of concrete problems that looked identical to mine and were solved by adding some data to the header requests. This is not the case, I've tried all the solutions and none works. Tried: this question and this one, nothing seems to work.

I'm trying to read contents of a website using Java. The url is URL to fetch. There's no authentication involved, and no forms are filled before. I can open that url in a cookie-free session and it still works with browser. I can even fetch it with Selenium, but HttpClient keeps refusing to load the URL.

The problem has nothing to do with certificates, right now I'm using a working "allow-all" certificate manager, so that's not the issue.

I've inspected my browser sent headers, nothing special:

Host: www.hipercor.es
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate, br
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1

As I said, I've already tried configuring the user agent to "fake" being Firefox.

Just to give some background, I'm building a enhanced version of crawler4j, my idea is to build a web scraper, and I found this issue testing random shops I knew are currently being crawled in bussiness environment by other scrapers.

Note that HeadRequest also fail.

The errors are either

java.net.SocketException: Connection reset

or

java.net.SocketTimeoutException: Read timed out

Please, note that browser loads it perfectly, as well as using Selenium Drivers to load the page (although it is slow as hell in that case)

DGoiko
  • 346
  • 2
  • 19

0 Answers0