3

TL/DR: Why shouldn't we prefer https: IRIs when defining new vocabularies for the semantic web?

The semantic web is built around the use of IRIs to identify various components, be they resources like a webpage or abstract concepts like ownership. Every source I've consulted recommends the use of http: IRIs specifically, for example:

This surprises me slightly. The world seems to be moving away from HTTP to HTTPS, yet I know of no vocabulary that uses https: IRIs, and none of the documents quoted above even discuss the question. I can find discussion on why ftp: or urn: are less good choices, but nothing about https:.

Even though IRIs on the semantic web are primarily identifiers not locators, there's a convention that the IRI is a good place to look for more information about entity, and various authorities recommend 303 redirects to documents like RDF or OWL schemas or other descriptive documents with further information. If the IRI is an http: one, at least the initial request and redirect may be made over HTTP. Even if the schema content is in no sense confidential, it still has the following problems:

  • It is susceptible to a man-in-the-middle attack. A malicious party could inject deliberately inconsistent schema information that may affect processing decisions made by applications, potentially causing a DoS or otherwise disrupting the user experience.

  • ISPs may do a MITM themselves to inject adverts into content. Really they oughtn't to do this for non-HTML content (well, they shouldn't do it at all, but that's another matter), but that relies on the ISP caring enough to get this right. This can still happen over HTTPS, as Superfish demonstrated, but it's much harder.

  • The request might be tracked by ISPs. The fact that a user is using an application that consults a particular schema is itself valuable information about the customer that can be sold to advertisers which the US Senate recently voted to make legal. People are increasingly privacy-conscious and want to minimise this. Of course the ISP still knows which domain you've visited as the SNI field is not encrypted, but we can still seek to minimise the leak of data.

If the client supports it, HSTS can be used to ensure subsequent accesses go directly over HTTPS, but this does nothing about the initial request that is still made over HTTP. Attempts at putting similar functionality in DNS have so far come to nothing, I suspect partly due to the slow adoption of DNSSEC. I'm not aware of any other technical measures that might alleviate the problems discussed above.

These considerations all suggest to me that https: is a better choice than http: when defining a new vocabulary. Obviously the situation is different if you have an existing vocabulary that already uses http:, but that's not the case I'm interested in here.

However I'm sure I'm not the first person to think of this, so I can only think everyone still uses and recommends http: for a reason. If so, what are the disadvantages of https:? And can anyone direct me to a good discussion of this? So far as I can see the W3C have nothing on the subject which surprises me.

Community
  • 1
  • 1
Richard Smith
  • 2,953
  • 2
  • 15
  • 15
  • 2
    My two cents: see [1](https://www.w3.org/blog/2016/05/https-and-the-semantic-weblinked-data/), [2](https://pdfs.semanticscholar.org/7ab8/12ba3d0e8f3ff7dc5170cee37d2ccb441e09.pdf), [3](http://www.unicse.org/publications/2010/november/Making%20secure%20Semantic%20Web.pdf) and also the Semantic Web Layer Cake image :). – Stanislav Kralin Jul 03 '17 at 11:21
  • 1
    Thanks for those links. They're all new to me, so I'll need to read them carefully. I'm familiar, of course, with the "layer cake" diagram which puts trust at the very top of the stack. The problem is that the top of the stack doesn't yet exist, so currently there is no standard trust layer. I'm sure the eventual solution to trust will involve digital signatures and will mitigate the potential for a DoS from a MITM, but won't address the privacy issue in my third bullet point. In any case, I don't see why having trust at the top of the stack should preclude encryption at the bottom. – Richard Smith Jul 03 '17 at 11:35
  • 2
    I've read the [first of those links](https://www.w3.org/blog/2016/05/https-and-the-semantic-weblinked-data/) and it seems to focus on the W3C's existing vocabularies. I understand that you cannot easily change existing IRIs to `https:` as most semantic web technologies do a bytewise comparison of IRIs and will not recognise the `https:` version. I'm interested in what to do when defining a new vocabulary (and have edited my question to make this slightly clearer). – Richard Smith Jul 03 '17 at 11:52
  • 4
    [Halpin's paper](https://pdfs.semanticscholar.org/7ab8/12ba3d0e8f3ff7dc5170cee37d2ccb441e09.pdf) (your second URL) concludes that we should use `https:` IRIs. In explaining why this isn't current practice he links to a [page by Berners-Lee](https://www.w3.org/DesignIssues/Security-NotTheS.html) arguing that `https:` IRIs (as opposed to the use of TLS in the transport layer) were a mistake. However Berner-Lee's rationale is largely based on backwards compatibility that isn't relevant to new IRIs. This makes me wonder whether there's another factor that Halpin and I are overlooking. – Richard Smith Jul 03 '17 at 13:08
  • 2
    [Is SO the right place for this question/discussion](https://stackoverflow.com/help/dont-ask)? – TallTed Jul 03 '17 at 16:25
  • 2
    Just a quick comment to say I don't think the [third](http://www.unicse.org/publications/2010/november/Making%20secure%20Semantic%20Web.pdf) of @StanislavKralin's links above adds anything further. – Richard Smith Jul 03 '17 at 18:24
  • 1
    I don’t think that the linked documents recommend HTTP over HTTPS. As far as I can see, none of the documents say something against HTTPS. It’s not uncommon to only mention "HTTP", but to actually mean "HTTP or HTTPS". -- FWIW, the popular vocabulary [Schema.org can be used with HTTP and with HTTPS URIs](https://webmasters.stackexchange.com/a/68741/17633). – unor Jul 04 '17 at 14:49
  • Unfortunately [that answer](https://webmasters.stackexchange.com/a/68741/17633) is not wholly accurate, as I've just [commented there](https://webmasters.stackexchange.com/questions/68709/secure-and-non-secure-schema-org-markup/68741#comment137399_68741). Regardless of what they say or intend, schema.org doesn't provide any schema information on the `https:` versions; more seriously, they use the `http:` version to refer to the abstract term (e.g. a person) and the `https:` version to refer to the document defining its use. This causes serious problems using `https://schema.org/` terms. – Richard Smith Jul 04 '17 at 15:09

0 Answers0