Computer Memory Allocation for Duplicate Inputs

Question

I'm taking Introduction to CS (CS50, Harvard) and we're learning type declaration in C. When we declare a variable and assign a type, the computer's allocating a specific amount of bits/bytes (1 byte for char, 4 bytes for int, 8 bytes for doubles etc...).

For instance, if we declare the string "EMMA", we're using 5 bytes, 1 for each "char" and 1 extra for the \0 null byte.

Well, I was wondering why 2 M's are allocated separate bytes. Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?

Would love some education on the matter (without getting too deep, as I'm fairly new to the field).

Edit: Fixed some bits into bytes — my bad

_For instance if we declare the string "EMMA", we're using 5 bits_ ; you mean 5 bytes, regarding your question, take a look to [Data structure alignment](https://en.wikipedia.org/wiki/Data_structure_alignment) — David Ranieri, Feb 24 '20 at 13:04
Well a reference must at least have an address where to find the real value - and the address would take 4 or 8 bytes (32-bit/64-bit system) — Odysseus, Feb 24 '20 at 13:08
Basically you want to know why it can't just store "EMA" and remember to use the M twice. But how would it remember to use the M twice? — user253751, Feb 24 '20 at 13:09
Think of it this way: why do you write two L when you write "Hallo" on a piece of paper, instead of "Halo"? You'll save paper space if you skip one L. Except now the word has a different meaning. So you have to explain that, by writing the following on top of the paper: "in the text below, replace 'Halo' with 'Hallo'". And that text takes up far more paper space than those L did. — Lundin, Feb 24 '20 at 13:51

Simon Doppler · Accepted Answer · 2020-02-27T07:07:45.220

1 bit for char, 4 bytes for int, 8 bytes for doubles etc...

These are general values but they depend on the architecture (per this answer, there are even still 9-bit per byte architectures being sold these days).

Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?

While this idea is certainly feasible in theory, in practice the overhead is way too big for simple data like characters: one character is usually a single byte.

If we were to set up a system in which we allocate memory for the character value and only refer to it from the string, the string would be made of a series of elements which would be used to store which character should be there: in C this would be a pointer (you will encounter them at some point in your course) and is usually either 4 or 8 bytes long (32 or 64 bits). Assuming you use a 32-bit pointer, you would use 24 bytes of memory to store the string in this complex manner instead of 5 bytes using the simpler method (to expand on this answer, you would need even more metadata to be able to properly modify the string during your program's execution).

Your idea of storing a chunk of data and referring to it multiple times does however exist in several cases:

virtual memory (you will encounter this if you go towards OS development), where copy-on-write is used
higher level languages (like C++)
filesystems which implement a copy-on-write feature, like BTRFS
some backup systems (like borg or rsync) which deduplicate the files/chunks they store
Facebook's zstandard compression algorithm, where a dictionnary of small common chunks of data is used to improve compression ratio and speed

In such settings, where lots of data are stored, the relative size of the information required to store the data once and refer to it multiple times while improving copy time is worth the added complexity.

Oh wow! Correct me if I'm wrong, but basically what you're saying is that for this specific example such memory reference automation would actually tax more than it helps, right? And it's also very cool to know in such large data sets the question I've had in mind is considered. Can't wait to advance in the course and learn more, thanks for your clarification! :) — ahmetalper, Feb 24 '20 at 13:27
You are correct: for small amounts of data, it would. However, there is no absolute measurement of how big the data should be before switching to such a system is worth it (but for characters it is definitely not worth it). — Simon Doppler, Feb 24 '20 at 13:30

score 1 · Answer 2 · answered Feb 24 '20 at 13:20

For instance if we declare the string "EMMA", we're using 5 bits

I am sure you are speaking about 5 bytes instead of 5 bits.

Well, I was wondering why 2 M's are allocated separate bits. Can't the computer make use of the chars or integers currently taking space in the memory and refer to that specific slot when it wants to reuse it?

A pointer to a "slot" usually occupies 4 or 8 bytes. So there is no sense to spend 8 bytes to point to an object that occupies only one byte

Moreover "EMMA" is a character array that consists from adjacent bytes. So all elements of the array has the same type and correspondingly size.

The compiler can reduce the memory usage by avoiding duplicated string literals. For example it can stores the same string literals as one string literal. This depends on a compiler option.

So if in the program the same string literal occurs for example two times as in these statements

char *s = malloc( sizeof( "EMMA" ) );
strcpy( s, "EMMA" );

then the compiler can store only one copy of the string literal.

score 0 · Answer 3 · answered Feb 24 '20 at 13:07

The compiler is not supposed to be the code/program but something that does the minimal and it has to perform tasks such that it is easy for programmers to understand and manipulate,in other words it has to be general.

as a programmer you can make your program to save data in the suggested way but it won't be general .

eg- i am making a database for my school and i entered a wrong name and now i want to change the 2nd 'm' in "EMMA",now this would be troublesome if the system worked as suggested by you.

would love to clarify further if needed. :)

Computer Memory Allocation for Duplicate Inputs

3 Answers3