2

I'm making a dictionary compressor in C with dictionary max size 64000. Because of this, I'm storing my entries as 16-bit integers.

What I'm currently doing: To encode 'a', I get its ASCII value, 97, and then convert this number into a string representation of the 16-bit integer of 97. So I end up encoding '0000000001100001' for 'a', which obviously isn't saving much space in the short run.

I'm aware that more efficient versions of this algorithm would start with smaller integer sizes (less bits of storage until we need more), but I'm wondering if there's a better way to either

  1. Convert my integer '97' into an ASCII string of fixed length that can store 16 bits of data (97 would be x digits, 46347 would also be x digits)

  2. writing to a file that can ONLY store 1s and 0s. Because as it is, it seems like I'm writing 16 ascii characters to a text file, each of which is 8 bits...so that's not really helping the cause much, is it?

Please let me know if I can be more clear in any way. I'm pretty new to this site. Thank you!

EDIT: How I store my dictionary is entirely up to me as far as I know. I just know that I need to be able to easily read the encoded file back and get the integers from it.

Also, I can only include stdio.h, stdlib.h, string.h, and header files I wrote for the program.

Ethan Roseman
  • 111
  • 1
  • 3
  • 13
  • 8
    "Most efficient way to store an unsigned 16-bit Integer to a file" - `write(fd, &the_integer, sizeof(the_integer));` –  Mar 20 '13 at 17:01
  • Just skip the ASCII conversions. A 2 byte integer should occupy 2 bytes in the file. – Drew Dormann Mar 20 '13 at 17:04
  • 3
    I don't suppose you have the slightest desire to make this dictionary platform independent? If so, run the `uint16_t` values you're going to store through [`htons()`](http://linux.die.net/man/3/htons) before storing them, and through [`ntohs()`](http://linux.die.net/man/3/ntohs) when reading them back. Though not part of the standard library, these *are* part of POSIX.1-2001, and likely available on your implementation. – WhozCraig Mar 20 '13 at 17:07
  • perform binary file writing method. that means to say what the variable occupies in memory, it occupies in the file. which also means size of the file = no of bytes(provided no file formating exists). – Koushik Shetty Mar 20 '13 at 17:08
  • @Koushik the file format OP needs to write to only takes string 0 and 1s. It's a text file, not a binary one – Ed Rowlett-Barbu Mar 20 '13 at 17:08
  • @Ethan Roseman what is the range of characters that will be written into the dictionary? Are your words made up of ASCII chars only? Or is it an even more limited set of strings? – Ed Rowlett-Barbu Mar 20 '13 at 17:10
  • @Zadirion if storing and retrieving is up to hismercy then why store 0's and 1's? why a text file? binary file will do the job and save space. – Koushik Shetty Mar 20 '13 at 17:22
  • @Zadirion Yes, this sounds exactly like what I want to be doing. I just have no idea how in C. – Ethan Roseman Mar 20 '13 at 17:23
  • @H2CO3 That isn't a very reliable way to store an unsigned 16-bit integer into a file. – autistic Mar 20 '13 at 18:26
  • @modifiablelvalue Nobody said it was correct, it's just the most lightweight approach I know of. –  Mar 20 '13 at 18:31
  • @H2CO3: Creating a file specification that isn't specific is a great way to create files that aren't compatible with other systems, which is a great way to invoke undefined behaviour... – autistic Mar 20 '13 at 19:24

2 Answers2

1

Please, do ignore these people who are suggesting that you "write directly to the file". There are a number of issues with that, which ultimately fall into the category of "integer representation". There appear to be some compelling reasons to write integers straight to external storage using fwrite or what-not, there are some solid facts in play here.

The bottleneck is the external storage controller. Either that, or the network, if you're writing a network application. Thus, writing two bytes as a single fwrite, or as two distinct fputcs, should be roughly the same speed, providing your memory profile is adequate for your platform. You can adjust the amount of buffer that your FILE *s use to a degree using setvbuf (note: must be a power of two), so we can always fine-tune per platform based on what our profilers tell us, though this information should probably float gracefully upstream to the standard library through gentle suggestions to be useful for other projects, too.

Underlying integer representations are inconsistent between todays computers. Suppose you write unsigned ints directly to a file using system X which uses 32-bit ints and big endian representation, you'll end up with issues reading that file on system Y which uses 16-bit ints and little endian representation, or system Z which uses 64-bit ints with mixed endian representation and 32 padding bits. Nowadays we have this mix of computers from 15 years ago that people torture themselves with to ARM big.Little SoCs, smartphones and smart TVs, gaming consoles and PCs, all of which have their own quirks which fall outside of the realm of standard C, especially with regards to integer representation, padding and so on.

C was developed with abstractions in mind that allow you to express your algorithm portably, so that you don't have to write different code for each OS! Here's an example of reading and converting four hex digits to an unsigned int value, portably:

unsigned int value;
int value_is_valid = fscanf(fd, "%04x", &value) == 1;
assert(value_is_valid); // #include <assert.h>
                        /* NOTE: Actual error correction should occur in place of that
                         *       assertioon
                         */

I should point out the reason why I choose %04X and not %08X or something more contemporary... if we go by questions asked even today, unfortunately there are students for example using textbooks and compilers that are over 20 years old... Their int is 16-bit and technically, their compilers are compliant in that aspect (though they really ought to push gcc and llvm throughout academia). With portability in mind, here's how I'd write that value:

value &= 0xFFFF;
fprintf(fd, "%04x", value);
// side-note: We often don't check the return value of `fprintf`, but it can also become   \
              very important, particularly when dealing with streams and large files...

Supposing your unsigned int values occupy two bytes, here's how I'd read those two bytes, portably, using big endian representation:

int hi = fgetc(fd);
int lo = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0); // again, proper error detection & handling logic should be here
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;

... and here's how I'd write those two bytes, in their big endian order:

fputc((value >> 8) & 0xFF, fd);
fputc(value & 0xFF, fd);
// and you might also want to check this return value (perhaps in a finely tuned end product)

Perhaps you're more interested in little endian. The neat thing is, the code really isn't that different. Here's input:

int lo = fgetc(fd);
int hi = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0);
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;

... and here's output:

fputc(value & 0xFF, fd);
fputc((value >> 8) & 0xFF, fd);

For anything larger than two bytes (i.e. a long unsigned or long signed), you might want to fwrite((char unsigned[]){ value >> 24, value >> 16, value >> 8, value }, 1, 4, fd); or something for example, to reduce boilerplate. With that in mind, it doesn't seem abusive to form a preprocessor macro:

#define write(fd, ...) fwrite((char unsigned){ __VA_ARGS__ }, 1, sizeof ((char unsigned) { __VA_ARGS__ }), fd)

I suppose one might look at this like choosing the better of two evils: preprocessor abuse or the magic number 4 in the code above, because now we can write(fd, value >> 24, value >> 16, value >> 8, value); without the 4 being hard-coded... but a word for the uninitiated: side-effects might cause headaches, so don't go causing modifications, writes or global state changes of any kind in arguments of write.

Well, that's my update to this post for the day... Socially delayed geek person signing out for now.

autistic
  • 1
  • 3
  • 35
  • 80
0

What you are contemplating is to utilize ASCII characters in saving your numbers, this is completely unnecessary and most inefficient.

The most space efficient way to do this (without utilizing complex algorithms) would be to just dump the bytes of the numbers into the file (the number of bits would have to depend on the largest number you intend to save. Or have multiple files for 8bit, 16bit etc.

Then when you read the file you know that your numbers are located per x # of bits so you just read them out one by one or in a big chunk(s) and then just make the chunk(s) into an array of a type that matches x # of bits.

  • The largest number I intend to save is 64000, so how would I dump the bytes of the numbers into the file with this in mind? Thanks for your help so far! – Ethan Roseman Mar 20 '13 at 18:43
  • size_t fwrite(const void *ptr, size_t size_of_elements, size_t number_of_elements, FILE *a_file); –  Mar 20 '13 at 18:46
  • 1
    fwrite(&anything, sizeof(anything), 1, myfile); –  Mar 20 '13 at 18:48
  • Not portable! Suppose one system writes 2 bytes, another system reads 4... Is this how C is written in the workplace? If I were a lecturer I'd fail any students who use this... – autistic Mar 20 '13 at 19:56
  • 2
    Why would another system read 4? It's only for his system and if he knows it will be 2 bytes there is no danger no matter what system you use so long as you tell it it's 2 bytes with is easy to do with integer types, __int16 comes to mind. –  Mar 20 '13 at 20:51
  • @dingrite: This is a university assignment... It could be as simple as an examining lecturer using a different compiler, or different compiler switches that causes different integer size or representation. University is meant to train professionals, and with this attitude I hardly think you are sufficiently qualified to train professionals. Which book are you reading? – autistic Mar 21 '13 at 15:29
  • @modifiablelvalue Please name a compiler that would treat uint16_t as anything other than 16 bit. But I can see you are obsessed with portability (integer representation especially unsigned to my knowledge comes in only one form). Suit yourself with larger code and more overhead. –  Mar 23 '13 at 02:24