-2

This previous question dealt with the handling of self-signed certificates in Java:

Accept server's self-signed ssl certificate in Java client

The accepted answer offers 2 possible options: (1) manually load the relevant certificate into the local keystore (2) circumvent UrlConnection's security with a bespoke TrustManager

In the context of a web crawler whose function is solely to extract content from remote https secured sites, what specific risks arise from option 2.

And, assuming those risks are deemed unacceptable, what alternative exists since it is not viable to manually extract the certificates and load into the local keystore.

joedev
  • 65
  • 5
  • 1
    There is no alternative- if you want to read random untrusted content then you will need a truststore that does t check chains. The risks are less than for a normal user, as you say. But it does mean that your crawler can be trivially “tricked” - anyone can pretend to be `google.com`, say, and provide you with arbitrary content. There also always risk of exploits of 0-days in your crawler with malicious responses. – Boris the Spider May 05 '21 at 20:11
  • There is an alternative if the remote sites have known certificates - create your own TrustManager instance and initialise it with your own keystore of trusted certificates. – Simon G. May 05 '21 at 21:55
  • @SimonG I'm still left with the problem of collecting and loading those certificates into the keystoe. – joedev May 05 '21 at 21:59
  • @BoristheSpider If arbitrary content is the extent of the risk then it's probably acceptable. We are also crawling non secured end points so I guess we already face the risks you mention? – joedev May 05 '21 at 22:05

1 Answers1

0

Not only from option 2 but also from 1, the only risk is that there is no guarantee that the server your webcrawler is crawling is effectively the one you think it is. There are other risks but are not associated with the tasks a web crawler does.

For your second question: You need to identify why it is unnaceptable, because it is very easy to code in java to just accept the self-signed certificate. What specifically blocks you to code to accept the certificate? You can use a proxy server to automatically accept all certificates but this is a separate topic and it would be better to create a new question for it.

  • "You need to identify why it is unaceptable". I don't necessarily. – joedev May 05 '21 at 22:07
  • Sorry timed out .... "You need to identify why it is unaceptable". That's my point really. In the previous question that was raised there was a lot of security concerns raised over Option 2. I didn't think they had serious relevance to a crawler except of course that there is a risk the content source is not as labeled. The coding (as you say) is straight forward. – joedev May 05 '21 at 22:19