56

Unicode defines several control characters from ASCII. http://www.unicode.org/charts/PDF/U0000.pdf

I see many control characters are widely used but I really don't see where "information separators" are used. (U+001C~U+001F)

What are they? What's their history? What were they used for?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
eonil
  • 83,476
  • 81
  • 317
  • 516
  • 2
    The field and record separators can be used to marshal table data as a string. It's a bit archaic, but it works. – The Nail Jan 01 '12 at 20:07
  • 6
    Thanks for asking this. I'm totally going to use unit separators instead of tab or comma-delimiting text now. – bugloaf Jun 18 '17 at 16:17
  • 2
    FYI, Unicode actually defines *all* 128 characters of US-ASCII, not just some of the control characters. Unicode is a superset of US-ASCII. – Basil Bourque Jul 10 '21 at 20:46

3 Answers3

68

Lammert Bies explains both their usage and the history behind.

28 – FS – File separator The file separator FS is an interesting control code, as it gives us insight in the way that computer technology was organized in the sixties. We are now used to random access media like RAM and magnetic disks, but when the ASCII standard was defined, most data was serial. I am not only talking about serial communications, but also about serial storage like punch cards, paper tape and magnetic tapes. In such a situation it is clearly efficient to have a single control code to signal the separation of two files. The FS was defined for this purpose.

29 – GS – Group separator Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn't used at that moment and the ASCII people called it a group.

30 – RS – Record separator Within a group (or table) the records are separated with RS or record separator.

31 – US – Unit separator The smallest data items to be stored in a database are called units in the ASCII definition. We would call them field now. The unit separator separates these fields in a serial data storage environment. Most current database implementations require that fields of most types have a fixed length. Enough space in the record is allocated to store the largest possible member of each field, even if this is not necessary in most cases. This costs a large amount of space in many situations. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. I can't imagine a situation where modern SQL databases are run with the data stored on paper tape or magnetic reels...

A Unit separator could provide essentially the same purpose as a comma in a CSV file or a tab in a tab-delimited file.

Community
  • 1
  • 1
Jonas Elfström
  • 30,834
  • 6
  • 70
  • 106
  • 27
    It's kind of depressing to learn that the CSV and TSV file formats are based on a lack of ASCII knowledge. – Dag Høidahl Jan 14 '20 at 13:40
  • 6
    CSV uses comma, but that's a printable character. All the control codes are considered non-printable, and are thus binary data - even if mostly human readable. So there is a small difference. – Chris Uzdavinis Aug 04 '20 at 13:11
  • 5
    Re: _"All the control codes are considered non-printable"_ - in their ASCII form, they are. But, as I discovered today, Unicode _also_ has printable characters representing these control codes: ␜ ␝ ␞ ␟ – Rounin Jan 03 '22 at 23:23
11

Did you mean that most of them are usually not used these days? The control characters mostly relate to device control functions, but some of them may have been used as separators in text files. For a quick reference, check my table of C0 Controls.

The information separators have been used to group data in a simple manner, but these days, either binary formats or XML format are used for data organization. There are still curiosities, like the internal use of U+001E and U+001F in Microsoft Word to implement the program’s own idea of “nonbreaking hyphen” and “optional hyphen” (as opposite to Unicode characters for similar purposes). This mainly illustrates that programs can use control characters in weird ways. Problems arise of course if the characters are included in text transmitted to other programs.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
2

They're deliberately ambiguous in function. From the standard reference for character coding development (Mackenzie, Charles E. Coded-Character Sets: History and Development. Addison-Wesley Longman Publishing Co., Inc., 1980.), chapter 26 section 1, page 460:

Four additional general-purpose characters, the so-called information separators, were designed into the [ASCII] 7-Bit Code and into EBCDIC. File Separator, Group Separator, Record Separator, and Unit Separator were defined broadly to be used to separate blocks of information. But how they were to be used to separate blocks, what philosophy of file and record structuring was to be used, was intentionally not specified. Such detailed specification would be left to the particular data processing application in which the separators would be used. Initially, a hierarchial [sic] philosophy of structuring information blocks was defined. A “file” was larger than, and would enclose, “groups.” A “group” was larger than, and would enclose, “records.” And a “record” was larger than, and would enclose, “units.” Eventually, the standards committees made this hierarchial [sic] specification optional; that is, the separators need not be used hierarchially [sic], buf if they were, then the hierarchy would be as described above. The standards committees realized that, as with the Device Controls, the unspecificity of the information separators could lead to difficulty of information interchange, but such difficulties could be worked out in the rare instances when they arose.

One example of a standard that uses this outline hierarchy is a now-superseded version of the ANSI/NIST-ITL Standard for interchange of forensic biometric images. The ITL “Traditional Encoding” used the ASCII separators as follows:

␜ File separator character – separates logical records.

␝ Group separator character – separates fields.

␞ Record separator character – separates repeated subfields.

␟ Unit separator character – separates information items.

This usage might appear to contradict the named purpose of the separators, but understanding the intended hierarchy of the character codes makes the choices in the ITL standard more appropriate.

A current example of a data format that uses an ASCII separator control code is JavaScript Object Notation (JSON) text sequence format (RFC 7464, media type application/json-seq) which places a ASCII Record Separator (0x1E) character before each record.

scruss
  • 1,030
  • 10
  • 24