TL/DR: Why shouldn't we prefer https:
IRIs when defining new vocabularies for the semantic web?
The semantic web is built around the use of IRIs to identify various components, be they resources like a webpage or abstract concepts like ownership. Every source I've consulted recommends the use of http:
IRIs specifically, for example:
- Linked Data book (2011),
- UK government Open Data initiative (2010),
- W3C note on Cool URIs (2008), and
- W3C note on best practices for RDF vocabularies (2008).
This surprises me slightly. The world seems to be moving away from HTTP to HTTPS, yet I know of no vocabulary that uses https:
IRIs, and none of the documents quoted above even discuss the question. I can find discussion on why ftp:
or urn:
are less good choices, but nothing about https:
.
Even though IRIs on the semantic web are primarily identifiers not locators, there's a convention that the IRI is a good place to look for more information about entity, and various authorities recommend 303 redirects to documents like RDF or OWL schemas or other descriptive documents with further information. If the IRI is an http:
one, at least the initial request and redirect may be made over HTTP. Even if the schema content is in no sense confidential, it still has the following problems:
It is susceptible to a man-in-the-middle attack. A malicious party could inject deliberately inconsistent schema information that may affect processing decisions made by applications, potentially causing a DoS or otherwise disrupting the user experience.
ISPs may do a MITM themselves to inject adverts into content. Really they oughtn't to do this for non-HTML content (well, they shouldn't do it at all, but that's another matter), but that relies on the ISP caring enough to get this right. This can still happen over HTTPS, as Superfish demonstrated, but it's much harder.
The request might be tracked by ISPs. The fact that a user is using an application that consults a particular schema is itself valuable information about the customer that can be sold to advertisers which the US Senate recently voted to make legal. People are increasingly privacy-conscious and want to minimise this. Of course the ISP still knows which domain you've visited as the SNI field is not encrypted, but we can still seek to minimise the leak of data.
If the client supports it, HSTS can be used to ensure subsequent accesses go directly over HTTPS, but this does nothing about the initial request that is still made over HTTP. Attempts at putting similar functionality in DNS have so far come to nothing, I suspect partly due to the slow adoption of DNSSEC. I'm not aware of any other technical measures that might alleviate the problems discussed above.
These considerations all suggest to me that https:
is a better choice than http:
when defining a new vocabulary. Obviously the situation is different if you have an existing vocabulary that already uses http:
, but that's not the case I'm interested in here.
However I'm sure I'm not the first person to think of this, so I can only think everyone still uses and recommends http:
for a reason. If so, what are the disadvantages of https:
? And can anyone direct me to a good discussion of this? So far as I can see the W3C have nothing on the subject which surprises me.