0

I really don't understand, why the character < is disallowed within attributes of xml tags. It has to be surrounded by either double or single quotes anyway so it should not be a problem at all to parse. (Not even performance wise or anything)

I am really interested in the motivation of constraining the language in such an annoying way because I am tended to write a XML preprocessor which just replaces all occurrences of angle brackets within attributes with the escaped form before passing a file to the actual XML parser, just to make him happy. But am wondering if there is anything I am missing?

Martin Ring
  • 5,404
  • 24
  • 47
  • 1
    Code for escaping such characters might already be available for the framework you're using. – Codor Apr 03 '14 at 14:13
  • @Codor Yeah, you're right. But I want to understand the rationale behind this design decision of the XML standard before violating it. – Martin Ring Apr 03 '14 at 14:25

2 Answers2

1

The short (and probably only) answer is that it's a design decision made when the XML spec was being written.

XML was designed deliberately to have a clear set of rules that could be strictly enforced without any possible ambiguities. One of those clear rules is that all occurrences of <, > and & must be escaped into entities.

Yes, they could have allowed circumstances where they didn't need to be escaped, but they weren't designing a language to make it easy for humans to write; they were designing it to make it easy for computers to generate it and to parse it. The strictness of the rules is a result of that. XML that has been generated properly will parse properly because there are no ambiguities.

In any case, it's a decision that has been made and is never going to be changed. That's the way XML is, so those are the rules you have to follow.

There are a surprising number of system out there which generate "XML" that fails these rules. This is bizarre because pretty much every language out there has an API for generating properly formed XML. One can only assume therefore that any systems that generate broken XML have been written to generate it "manually"; ie without using the APIs provided by the lanugage. This is an immediate red flag that the system has been written by a developer who really doesn't know what he's doing. The fact that so many of these systems exist is a scary indictment on the general quality of code out in the wide world.

Spudley
  • 166,037
  • 39
  • 233
  • 307
  • Thanks very much for your answer. However, I still don't understand how disallowing unambiguous (in the context of attributes) characters increases machine readability. Doesn't that actually require more complex parsing rules in order to reject malformed attributes? – Martin Ring Apr 03 '14 at 14:52
  • If you want to know what happens when you allow ambiguities in the language, look at the history of HTML. The browser engines went to great lengths to handle invalid or ambiguous code, and the end result was years of browser quirks where sites looked different across different browsers. The XML spec writers were explicitly trying to avoid those sorts of issues in XML. XML is a data-oriented language; you can't afford to have implementation-specific issues like that in data. It's bad enough in a presentational context, but cross-system communication that XML is used for simply can't allow it. – Spudley Apr 03 '14 at 15:05
  • I am not arguing with the point, that no machine language should have ambiguities. But how would allowing `<` character inside attributes lead to ambiguity? Do you have an example? – Martin Ring Apr 03 '14 at 15:11
  • The classic case in HTML is where you forgets to close your quotes, so the next tag flows into the attribute. HTML parsers try to deal with this kind of thing, with varying degrees of success -- we've all seen pages that look okay in one browser but utterly broken in another. XML is a data-oriented language: data being transferred from one system to another simply must be parsed correctly; you can't afford to have parsing quirks like that. Its bad enough in HTML where the viewer sees a broken page, but its much worse if you're importing mission-critical data into your company's main database. – Spudley Apr 03 '14 at 15:17
  • But isn't that a problem of parsers trying to compensate defective input rather than an ambiguity in the language? I believe unclosed quotes should lead to a parse error rather than something unforseeable. – Martin Ring Apr 03 '14 at 15:19
  • And for unclosed quotes: `` for example is completely valid xml after the standard even though it is most propably a mistake. What could allowing `<` possibly add to this confusion? – Martin Ring Apr 03 '14 at 15:23
  • Let me answer my own question: `` would be possible when `<` and `>` were allowed. So I guess it is more of a human readability issue after all. right? – Martin Ring Apr 03 '14 at 15:28
0

@Spudley pointed me into the right direction:

If allowing < and > inside attributes,

<tag attribute="value'/> <tag attribute='value"/>

would be valid xml which still most propably does not represent the intention of the author. While this kind of mistake would never be made by machines, humans could be temporarily confused after having written this code and not getting the expected interpretation by the parser. So the reason why these characters are not allowed can only be readability for humans.

Martin Ring
  • 5,404
  • 24
  • 47