HTML's handling of white-space characters depends on context - but what are the rules?

Question

The Unicode catalogue includes a number of white-space characters, some of which don't appear to work in any context in HTML documents - but some of which, rather usefully, do.

Here is an example:

<h1 title="Hi! As a title attribute, &#013;I can contain &#009;&#009;horizontal tabs &#013;and carriage returns &#010;and line feeds.">HTML's handling of &amp;009; | &amp;010; | &amp;013;</h1>

<p>Hello. As a paragraph element, I can't contain &#009;horizontal tabs &#013;or carriage returns &#010;or line feeds.</p>

<input type="submit" value="I am a value attribute and &#010;like title I can also handle line feeds" /><br />

<input type="submit" value="I am another value attribute. &#009;&#009;Like title I can handle horizontal tabs" /><br />

<input type="submit" value="I am a third value attribute. &#013;Unlike title I can't handle carriage returns" />

Is there any official spec or series of guidelines which detail which white-space characters can be deployed in HTML documents and where?

Sorry but those are not [control characters](https://en.wikipedia.org/wiki/ASCII#Control_characters) (more specifically, they are line feeds and tabs, i.e. printable characters). Are you really asking about stuff like **EOT End of Transmission** or **ACK Acknowledgement**? — Álvaro González, Nov 25 '16 at 12:16
I take your point. I am happy to edit my question above if my terminology is incorrect. Is there a better name for this type of ASCII character? — Rounin, Nov 25 '16 at 12:20
As per the answer you've finally accepted, the term is white space. All your references to ASCII and control characters were totally out of place (and "print" is even more misleading, so I'm editing it out myself if you don't mind). — Álvaro González, Nov 28 '16 at 08:36
Possible duplicate of [Browser white space rendering](http://stackoverflow.com/questions/24615355/browser-white-space-rendering) — Álvaro González, Nov 28 '16 at 08:40
Thanks @ÁlvaroGonzález - I upvoted your answer but I had to decide which answer to accept and in the end I went for Anne's since it was shorter and clearer. — Rounin, Nov 28 '16 at 08:53
Also, @ÁlvaroGonzález - as you correctly identified in your very first comment, the characters I was referring to were _not_ control characters. Thank you for that correction. — Rounin, Nov 28 '16 at 08:55

score 4 · Accepted Answer · answered Nov 25 '16 at 14:37

4

It's a little unclear what you mean by work, but I'm going to assume you mean rendering, at which point what happens is really up to CSS.

https://www.w3.org/TR/CSS2/text.html#white-space-model defines how most whitespace characters are normalized away, unless you adjust the white-space property.

Note that the display of toolbars (such as from the title attribute) and form controls (such as from input elements) is not defined by any standard, leaving that effectively up to browsers.

answered Nov 25 '16 at 14:37

Anne

7,070
1
26
27

Thank you for such a clear and concise answer. I was intrigued that should render in the value attribute, but not in the title attribute, but if this is a browser level decision outside the W3 spec, then that would explain the (apparent) inconsistency. – Rounin Nov 25 '16 at 21:52

Álvaro González · Answer 2 · 2016-11-28T08:39:11.833

Disclaimer: this answer was composed for the question as originally written, making explicit references to ASCII control characters. It was apparently a red herring so the information here may look confusing now.

First of all, I don't think nobody uses ASCII any more. In 2016 the only sensible encoding is UTF-8. Whatever, UTF-8 is a superset of ASCII (and you can use ASCII anyway) so the question is still be valid.

Secondly, your example isn't correct. All the HTML entities you mention are printable characters:

	 is 'CHARACTER TABULATION' (U+0009) (i.e. a tab)
 is 'CARRIAGE RETURN (CR)' (U+000D) (i.e. a legacy MacOS line feed)

 is 'LINE FEED (LF)' (U+000A) (i.e. a Unix line feed)

(And please note that Windows line feeds are a combination of CR+LF.)

If you're really talking about control characters:

EOT End of Transmission
ACK Acknowledgement
BEL Bell
...

... we first need to understand that HTML is meant to be plain text (as such, it's MIME content type is text/html). The HTML5 Living Standard provides a definition of control character that's wider than the ASCII one but in any case it doesn't seem to be allowed:

Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).

Any character that is a not a Unicode character, i.e. any isolated surrogate, is a parse error. (These can only find their way into the input stream via script APIs such as document.write().)

If you actually refer to the characters in your example, some of then are considered exceptions in the parsing stage:

U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF) characters are treated specially. Any LF character that immediately follows a CR character must be ignored, and all CR characters must then be converted to LF characters. Thus, newlines in HTML DOMs are represented by LF characters, and there are never any CR characters in the input to the tokenization stage.

... but I suspect you are only interested in white-space collapsing:

In HTML, only the following characters are defined as white space characters:

ASCII space ( )

ASCII tab ( )

ASCII form feed ()

Zero-width space ()

[...]

In particular, user agents should collapse input white space sequences when producing output inter-word space.

[...]

The PRE element is used for preformatted text, where white space is significant.

In other words, consecutive white space characters become a simple space (except inside <pre> tag). (I could only find a link for HTML 4 but that's something that hasn't changed significantly).

Is there any official spec or series of guidelines? Sure they are: you have the official W3C recommendations and the WHATWG specs but they're basically technical documentation mostly addressed at browser vendors: extensive, comprehensive and hard to decipher into plain English ;-)

HTML4 has a rather outdated definition of what constitutes as whitespace it seems and really should not be referenced anymore as it's long obsolete. The way this is defined in the HTML Standard today is by simply deferring to CSS, which is the proper place to define these details. (Although the HTML standard will suggest default rendering, such as that `pre` should have `white-space:pre` and such.) — Anne, Nov 25 '16 at 14:40
@Anne You're right about that. Shall my HTML 4 link only be taken as quick and dirty explanation. — Álvaro González, Nov 28 '16 at 08:33

HTML's handling of white-space characters depends on context - but what are the rules?

2 Answers2