0

This Wikipedia article on word sizes provides a table of word sizes in different computer architectures. It has different columns like 'integer size', 'floating point size' etc. I suppose, integer size is the size of arguments for ALU, floating point size is the size of arguments for FPU, unit of address resolution is the number of bits/trits/digits represented by a single address. word size is given as the natural size of data used by the processor (which is still confusing somewhat).

But I'm wondering what does the char size column in the table represents? Is it the smallest object size theoretically possible? Is it the smallest alignment possible? What are the common operations defined over data of char size? In x86, x86-64, ARM architectures char size is 8 bits, which is same as the smallest integer size. But on some other architectures, char size is 5/6/7 bits which is very different from the integer size in that architecture.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
Sourav Kannantha B
  • 2,860
  • 1
  • 11
  • 35
  • When I try to tag my question with `computer-architecture` tag, it is converting it to `cpu-architecture`, I don't know why!! – Sourav Kannantha B Feb 28 '22 at 16:17
  • That's because [tag:computer-architecture] is a synonym for [tag:cpu-architecture] here on SO: https://stackoverflow.com/tags/cpu-architecture/synonyms – Joachim Sauer Feb 28 '22 at 16:22
  • 1
    And regarding your actual question: "char size" is probably just the size of a single unit of text (i.e. a "character") as mentioned in that article. If that *doesn't* answer your question, then I don't know what other info you need, Sorry. – Joachim Sauer Feb 28 '22 at 16:23
  • @JoachimSauer I'm not sure if you are right. But, does computer architecture has anything to with text? Does it do any text manipulation or something? Are there any operations defined over "char size"-ed data? Otherwise isn't it unnecessary for an architecture to define "char size" as only IO devices are required to know it. – Sourav Kannantha B Feb 28 '22 at 16:31
  • ... For example, my machine is x86-64, apparently having 8-bit "char size", but still is able to render UTF-16 text which has 16-bit code-units. Maybe article is misleading, I'm not sure. But was curious after seeing that. – Sourav Kannantha B Feb 28 '22 at 16:33
  • 1
    Probably just the typical size of chars used on that machine type including for built-in low level routines for I/O (firmware, BIOS, operating system, hardware). Also some architectures have string operations, e.g. rep on x86, for comparing, searching, copying – Sebastian Feb 28 '22 at 16:50
  • 2
    As the article states: it was relevant "in the past (pre-variable-sized character encoding)". You have to keep in mind that compared to the average age of computer architectures in that table Unicode and UTF-* is an **insanely** new invention. The first version of Unicode was released in 1991, at which point "1 char = 8 bits" has already been firmly cemented, as is visible in that table. Anything before the 8-bit era tended to use some proprietary encoding for their computer. A good chunk of this table is even pre-ASCII (1963). – Joachim Sauer Feb 28 '22 at 16:53
  • In modern C, a `char` is guaranteed to be independently modifiable, without disturbing surrounding data. So on Alpha or word-addressable CPUs, a `char` had to be the word size, or else every `char` store would have to compile to an atomic RMW on the containing word. (Rather than a much cheaper *non*-atomic RMW like some early compilers actually used.) See [Can modern x86 hardware not store a single byte to memory?](https://stackoverflow.com/q/46721075) / [C++ memory model and race conditions on char arrays](https://stackoverflow.com/q/19903338). – Peter Cordes Mar 01 '22 at 02:53

1 Answers1

1

In modern C, a char is guaranteed to be independently modifiable, without disturbing surrounding data. It's usually chosen to be the width of the narrowest load/store instruction. So on Alpha or word-addressable CPUs, a char had to be the word size, or else every char store would have to compile to an atomic RMW on the containing word. (Rather than a much cheaper non-atomic RMW like some early compilers actually used, before C11 introduces a thread-aware memory model to the language.) See Can modern x86 hardware not store a single byte to memory? (which covers modern ISAs in general) and C++ memory model and race conditions on char arrays for the requirements C++11 and C11 place on char.

But that Wikipedia table of word and char sizes in historical machines is clearly not about that, given the sizes. (e.g. smaller than a word on some word-addressable machines, I'm pretty sure).

It's about how software (and character I/O hardware like terminals) packed multiple character of the machine's native character encoding (e.g. a subset of ASCII, EBCDIC, or something earlier) into machine words.

Unicode, and variable-length character encodings like UTF-8 and UTF-16, are recent inventions compared to that history. https://en.wikipedia.org/wiki/Character_encoding#History Many systems used fewer than 8 bits per character, e.g. 6 (64 unique encodings) is enough for the upper and lower case Latin alphabet plus some special characters and control codes.

These historical character sets are what motivated some of the choices for programming languages to use certain special characters or not, because they were developed on systems that had a certain character set.

Historical machines really did do things like pack 3 characters of text into an 18-bit word.

You might want to search on https://retrocomputing.stackexchange.com/, or even ask a question there after doing some more reading.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • So "char size" is meaningful only when the "word sized" register is storing some text data. Apart from that, you can't do any operations on char size, mainly loading and storing "char size"-ed data? – Sourav Kannantha B Mar 01 '22 at 07:37
  • Also, "word size" is given as natural size of the processor. What counts as natural size here? Is it the width of data lines or address lines? – Sourav Kannantha B Mar 01 '22 at 07:38
  • 1
    @SouravKannanthaB: Usually word size is integer register width. External busses might be wider or narrower. But in ancient machines, often the same width, not using extra logic to serialize a word into multiple separate byte operations or whatever like an 8088 would over its 8-bit bus. (Some of those machines are so old that there isn't really an "external", the CPU isn't even a single chip. But yeah there'd still be a memory bus of some sort.) – Peter Cordes Mar 01 '22 at 07:45
  • Posed in a different way, what made '64-bit' as the word size in x86-64? And what made '10-dig' as the word size in ENIAC? (in Wikipedia) – Sourav Kannantha B Mar 01 '22 at 07:46
  • @SouravKannanthaB: As for doing operations on single char elements, it depends on the machine. If they don't have narrower load/store instructions, then no you'd have to unpack with shifts and AND to work with each char individually. And yes in that case it's only meaningful for actual text data. – Peter Cordes Mar 01 '22 at 07:47
  • @SouravKannanthaB: In x86-64, that's the integer register width, the width of RAX, RCX, etc. I don't know ENIAC; presumably it used decimal instead of binary, and each storage location held 10-digit numbers. – Peter Cordes Mar 01 '22 at 07:48