"char*" with an unusual memory word size (Knuth's MIX architecture)

Question

The original MIX architecture features 6-bit bytes and memory is addressed as 31-bit words (5 bytes and a sign bit). As a thought exercise I'm wondering how the C language can function in this environment, given:

char has at least 8 bits (annex E of C99 spec)
C99 spec section 6.3.2.3 ("Pointers") paragraph 8 says "When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object." My interpretation of this requirement is that it underpins "memcpy(&dst_obj, &src_obj, sizeof(src_obj))".

Approaches that I can think of:

Make char 31 bits, so indirection through "char*" is simple memory access. But this makes strings wasteful (and means it isn't POSIX-compliant as that apparently requires 8 bit chars)
Pack three 8 bit chars into one word, with 7 ignored bits: "char*" might be composed of word address and char index within it. However this seems to violate 6.3.2.3, i.e. memcpy() would necessarily skip the ignored bits (which are probably meaningful for the real object type)
Fully pack chars into words, e.g. the fourth 8 bit char would have 7 bits in word 0 and one bit in word 1. However this seems to require that all objects are sized in 8 bit chars, e.g. a "uint31_t" couldn't be declared to match the word length since this again has the memcpy() problem.

So that seems to leave the first (wasteful) option of using 31-bit chars with all objects sized as multiples of char - am I correct in reading it this way?

The first C compiler didn't run on a PDP-7 (the linked Wikipedia article doesn't say anything like that). Just like Unix itself, it ran on a PDP-11/20 which has 8-bit bytes and 16-bit words. — cremno, Mar 04 '15 at 10:40
Just do what many "C" compilers for oddball/arcane devices have done before you and violate the standard to create something that is actually practically usable as a systems programming language. Recent C standards also requires two's complement and a host of other things which just aren't going to be workable for a machine like this without severely degrading the performance. — doynax, Mar 04 '15 at 10:42
Ooops - I conflated the Unix & C notes in http://www.linfo.org/pdp-7.html. So C did indeed come from an 8/16 bit background, which seems much more in the spirit of K&R and dilutes that last parenthetical point — Software Tyro, Mar 04 '15 at 10:44
@doxynax I was hoping (at least as a first cut) to leave the language work to something like clang (overkill for simple C) and just write a backend. But although the integer classes support arbitrary bit-widths I didn't find it being used beyond the customary 8/16/32/64/128 series, and that got me to wondering how/if it really could be used. — Software Tyro, Mar 04 '15 at 10:52
@Software Tyro: I fear you won't find much in the way of tools to support an architectures like this. Actually C never really was the portable assembly it is made out to be, that is more of an illusion caused by the influence of C on modern architectures rather than C model somehow being universal. For many common pointer/stack-averse MCUs C is actually quite painful. — doynax, Mar 04 '15 at 11:05
@doynax - interesting way of flipping the domain around (triumph of C drives modern architecture). On a similar note I think (didn't conclusively prove) that GDB stack walking implicitly assumes that callstack is a slab of ordinary RAM with interleaved return addresses and variables, which is of course true for current mainstream but wouldn't let it support some earlier machines (of course not a requirement, and obsolete machines routinely get culled from GDB support anyway) — Software Tyro, Mar 04 '15 at 11:20
I found [cluecc](http://cluecc.sourceforge.net/) a while back, it's a nice little toy that puts a C compiler onto some unusual virtual 'machines'. — user3710044, Mar 07 '15 at 11:04
@user3710044 - thanks for that steer; definitely an interesting addition to the headspace! — Software Tyro, Mar 09 '15 at 07:46
There are some similarities with this question : http://stackoverflow.com/questions/18537747/is-char-bit-4-a-possible-value-authorized-by-the-c-standard . Answers might help. — Cyan, Mar 24 '15 at 14:15
@Cyan - Thanks! Indeed I hadn't stumbled across that question (or that CPU) when searching. Interesting device, and impressively long-lived. Since the word type (nybble) divides evenly into 8-bit I think the handling of pointers would be fairly straightforward: all pointers can be the same size rather than needing "fat pointers" to subdivide words, char* increments in 2's, etc. Did you work with the the hp48gcc compiler at all? (I know this comment really belongs on your CHAR_BIT question but I don't have enough rep) — Software Tyro, Mar 25 '15 at 07:59
@Tyro Nope, I did not. hp48gcc is really wasting a lot of energy at trying to "adapt" 8-bits logic into Saturn CPU. So it was suboptimal I was considering to make my own compiler, but the discovery that I could never make it "C standard compliant" due to the restriction of 8 bits per CHAR made me think twice. Plus, anyway, it would have been a learning exercise, requiring tons of availability time to get through. By the way, standard pointers on this platform (size_t) are 20 bits (== 5 nibbles). — Cyan, Mar 25 '15 at 13:37
You might pack 31 8-bit bytes into 8 31-bit words wasting no bits. By this the compiler could conform to the C standard but performance could be expected to be quite low due to the complex addressing resulting from this. — Meixner, Apr 15 '15 at 09:17
@Meixner - I think byte-packing is only valid if all memory accesses for all types pass through it, i.e. implementing the memory model in software. Otherwise a native data type (say a 31 bit integer if it was more conventional than MIX) can't be accessed via "char*" since one of the bytes has only 7 bits, i.e. violating 6.3.2.3 [as I understand it; need a C language lawyer to confirm] — Software Tyro, Apr 16 '15 at 07:52
Step 1 "Make char 31 bits". `short` and `int` also 31-bits. `long` --> 62 bits. `long long` --> 124. Get compiler working with the "inefficient" model - then consider performance improvements. I suspect step 1 will itself be hard enough. — chux - Reinstate Monica, Apr 24 '15 at 01:58

score 5 · Answer 1 · edited May 23 '17 at 12:06

I agree that C on a MIX architecture could be a pain to implement, and, though not a language lawyer myself, it seems to me that you are right to point out your approach 1. as the only conforming one.

Anyway, the space waste for strings is the least of your problems: you can circumvent it by resorting to a solution older than C itself: making each char representing multiple letters. For the MIX architecture you could for example devise a 7-bit encoding and packing 4 letters into each char:

char hi[4];
hi[0] = 'hell';
hi[1] = 'o, w';
hi[2] = 'orld';
hi[3] = '\0';

printf("%s", hi);

// Whoops, we forgot the exclamation mark
putchar('!\n');

This implementation seems strange, but according to Wikipedia, it was used in the first "Hello world" program ever. I took a look at the Standard and found nothing preventing it, even in C11. Especially § 6.4.4.4 allows encoding literal characters and strings in implementation-specific ways.

EDIT:

This does not help against other difficulties, the main one being your inability to use the wide majority of your machine's possible instructions, since you cannot address single bytes with the native C types. You can however make use of bitfields in this way:

typedef struct _bytes {
    unsigned int sign  : 1;
    unsigned int byte1 : 6; // EDIT: bitfields must be 
    unsigned int byte2 : 6; // declared as ints in standard C
    unsigned int byte3 : 6;
    unsigned int byte4 : 6;
    unsigned int byte5 : 6;
} bytes;

typedef union _native_type {
    char as_word;
    int as_int; // int = char; useful for standard library functions, etc.
    bytes as_bytes;
} native_type;

Note that in C++, because of a clause in the strict aliasing rules, you must be careful to always access the char member between an access to int and one to the bytes, since this snippet:

native_type a, b;
a.as_int = 0xC11BABE;
b.as_bytes.byte4 = a.as_bytes.byte4; // Whoops

would yield undefined behavior: see here for details.

Very helpful info - if a little disturbing regarding the C/C++ gulf around unions. I'll leave the question as "unanswered" for a while longer in the hopes that a card-carrying certified language lawyer does arrive — Software Tyro, Apr 30 '15 at 08:47
@SoftwareTyro I'd also be curious of hearing a C++ guru's opinion. Maybe you could add the "language-lawyer" tag to your question? — Alberto M, Apr 30 '15 at 20:04
d'oh! I never even thought to look for that tag but of course although jocular it's very much in real use. Thanks again! — Software Tyro, May 04 '15 at 07:48
C++ allows it since chars (including signed/unsigned chars, and structs/unions/arrays containing them) can alias anything. — o11c, May 04 '15 at 07:55
@o11c You are right: I corrected my example. Anyway now the C++ solution has become much more viable, just a little less comfortable than the C one. — Alberto M, May 06 '15 at 04:23

score 2 · Answer 2 · answered May 05 '15 at 15:34

The most practical approach would probably to make int be 30 bits and have char be either 10 or 15. Using ten bits for char would allow ASCII text to be packed more tightly, but would increase the cost of indexing into char arrays because of the required divide-by-three. Storage of Unicode text could be fairly efficient with either 10- or 15-byte char. With 15-byte char, about 30720 code points would take 15 bits, and the remainder would take 30. With 10-byte char, 128 code points would take 10 bits, 65408 would take 20, and the remainder would take 30.

To mitigate the cost of dividing by 3, it may be helpful to have each char* contain two words; one would would identify the word containing the character, and the other would identify the offset, in characters, from the start of that word. Adding a constant offset to a pointer which was known to be normalized could use code like:

p += 5; // Becomes...
if (p.offset) { p.offset=2; p.base+=1; }
else { p.offset--; p.base+=2; }

Not great, but it would avoid any need for a "divide" step.

"char*" with an unusual memory word size (Knuth's MIX architecture)

2 Answers2