79

If, say, a 32-bit integer is overflowing, instead of upgrading int to long, can we make use of some 40-bit type if we need a range only within 240, so that we save 24 (64-40) bits for every integer?

If so, how?

I have to deal with billions and space is a bigger constraint.

Pier-Luc Gendreau
  • 13,553
  • 4
  • 58
  • 69
Aragon
  • 1,551
  • 1
  • 15
  • 25
  • why would you want to write more code(and hence consume more bytes) trying to save a few bytes? When you can write less code, and use long instead – Aniket Inge Dec 30 '14 at 12:25
  • 5
    Also memory is very cheap compared to CPU cycles to save those precious bytes – Ed Heal Dec 30 '14 at 12:31
  • 9
    @user1810087, Aniket... how do you know it is unnecessary? Or that it consumes more byte than its saves? Do you know the requirements and constraints? Perhaps he handles TB of data,where those 'few bytes' add up? – Greenflow Dec 30 '14 at 12:32
  • 26
    @Aniket: I see some usecases for that, especially when working with large datasets. I'm currently working with volumetric simulations in a 1024^3 cube. We have implemented a custom 36bit datatype, since this makes the difference whether our application can be used with 8GB RAM or not. Example: 1024^3 cube with 64bit = 8192MB, 36bit = 4608 bit. In this case the litte more code really doesn't matter. – BDL Dec 30 '14 at 12:32
  • Does not one with huge dataset to even consider this just use virtual memory. Whack in some more memory (that is not that expensive) + A fast HDD (or even SDD). A lot cheaper than writing/testing code and would work faster as well – Ed Heal Dec 30 '14 at 12:38
  • 1
    @Greenflow, BDL. you only save this bits if you combine the rest of this bits to a new variable. this extra logic / calculation will end up in runtime performance loss, even you have enough ram. but letting the os do it's thing (swapping/ paging), if possible and **necessary**, would result in nearly the same performance. but you right if the ram/os is limited. so i should rather say, it's **mostly** unnecessary. – user1810087 Dec 30 '14 at 12:45
  • 5
    There is some processors that implements in hardware 40bit integers (EX: some Texas Instruments processors). If you are running on one of these processors I would say yes go ahead! But if you are on hardware like x86 that only have 32 or 64 bit integers the costs may likely outweigh the benefits of using 40bit integers. – Trevor Boyd Smith Dec 30 '14 at 12:46
  • 26
    @All: How about letting user1660982 decide if he/she really wants it or not? Nobody here knows the amount of data or if speed is important. – deviantfan Dec 30 '14 at 12:47
  • 4
    @deviantfan, exactly. Of course, it could be a typical x-y problem. Without any context one cannot know. Nevertheless, the question is IMHO not uninteresting. And I pray I never have to implement something like that. :-) – Greenflow Dec 30 '14 at 12:53
  • all, I have to process lot of data and space is a constraint for me @user1810087 If we store the extra 8 bits to a new variable. What will be the performance benefits? – Aragon Dec 30 '14 at 13:00
  • To the comments/answers regarding performance: How many of these integers would need to be stored in memory at the same time? If the answer is "not many" then performance could possibly be worse with 40-bit integers due to all the masking and shifting operations. Do you care more about space or performance? – user1354557 Dec 30 '14 at 17:10
  • 1
    Basically, if you have to ask then the answer is "No". It can be done but is messy and you have to understand what you're doing. – Hot Licks Dec 30 '14 at 20:26
  • @user1660982 Are you sure 39 is too little? The code for storing and retrieving the values will not be fast; in this situation, you don't need a multiple of 8. So if you can stuff your values in 39 or maybe 35 bits, all the better. – anatolyg Dec 30 '14 at 21:03
  • 1
    Consider that a product of two 32-bit integers is a 64-bit integer that can easily overflow 40 bits as well. – Michael Dec 31 '14 at 06:14
  • 3
    @EdHeal performance of miss could be severe. SSD latency is 0.1 ms (100 us) while memory latency is 0.2 us (data from wikipedia). So you have only 200x the overhead. It means that read of 8 GiB takes 3.5 minutes instead 419 ms (assuming 4k pages, linear reading and infinite throughput). If you perform 10000 iteration of the algorithm it adds to quite a bit - 24 days vs 70 minutes. There are techniques which do employ more data locality then naive iteration over set they are more complicated then 40-bit ints. They might be used nonetheless as L3$ is even faster. – Maciej Piechotka Dec 31 '14 at 07:04
  • @MaciejPiechotka - The question is vague. It depends on the nature of what is being done. I think that just buying extra memory is probably the best bet. – Ed Heal Dec 31 '14 at 07:07
  • 1
    @EdHeal I don't know OP situation but if he has 1 billion integers it's 8 GB only for this array. Therefore if it's more then 4 or 8 billions (s)he needs to have a 'server-class' processor. If (s)he is a student or program is running on users' computers, upgrading the processor, motherboard and RAM might not be an option and coding a bit more might be simply the only viable option. Also I was answering "Does not one with huge dataset to even consider this just use virtual memory" and there are many applications where 8 GiB is size of toy problem (i.e. just to test perf). – Maciej Piechotka Dec 31 '14 at 07:27
  • @MaciejPiechotka - We simply do not know. I just think the not using 64 bit makes the code mode complex, more prone to errors and reduce CPU performance. I think the other options should be explored and only use this as a very very last resort. – Ed Heal Dec 31 '14 at 07:31
  • 1
    I don't know how anyone can assume/claim that this optimization is not worth it. It totally depends on the number of such integers required. – usr Dec 31 '14 at 14:17
  • 1
    @EdHeal - well you can get an educated guess and difference is significant - 24 days vs 70 minutes for example from my rough calculation (even worse if you have HDD) and assuming that "billions" means one. "Also memory is very cheap compared to CPU cycles to save those precious bytes" - that's not (necessarily) true for many applications. Von Neumann bottleneck is growing so memory access can be more expensive then CPU cycles (as long as your data don't fit nicely into D$) and this area is being researched as far as I know - particularly if you don't have nice access patterns. – Maciej Piechotka Jan 01 '15 at 12:09
  • 1
    Surely the data needs to come/go to some long term storage. So what is the point in loading the lot into memory in one go. Surely better effort could be employed in improving the algorithm. – Ed Heal Jan 01 '15 at 12:20
  • @deviantfan, usr That's exactly the problem. OP hasn't given us remotely enough information to advise them sufficiently. In general, trying to split data up like this is a bad idea, and that's what the comments here are saying. If OP does happen to have a valid use case for doing so, s/he has done a very poor job of demonstrating it. – JLRishe Jan 02 '15 at 04:56
  • @EdHeal The data needs to go from some storage once so you pay the 3.5 minute cost once. If you are swapping using LRU algorithm you cay the 3.5 minute at each iteration of algorithm which means that you pay it as many time as you perform iterations. So in case of 10,000 iterations the cost is 42 days. "Surely better effort could be employed in improving the algorithm." - I'm not sure what you mean. Many big data algorithms are 'effective' in a sense they have best algorithmic complexity (say O(n)). The only remaining part is to optimize the constant complexity... – Maciej Piechotka Jan 04 '15 at 21:15
  • And moving data to and from memory, optimal use of $ is usually considered the way to think about it due to von Neumann bottleneck as far as I know - there is no point in speeding up the CPU side if you cannot fill it with sufficient speed. We don't know if this is the case with OP but I can sympathize with not wanting to attach a "thesis summary" to each optimization question. – Maciej Piechotka Jan 04 '15 at 21:21
  • related: [Which C datatype can represent a 40-bit binary number?](http://stackoverflow.com/questions/9595225/which-c-datatype-can-represent-a-40-bit-binary-number) – phuclv Feb 05 '15 at 09:04
  • type sizes depend on architecture. If you're on some targets with [32-bit int and 40-bit long](http://stackoverflow.com/a/28352310/995714) like many TI DSPs then you can simply use long on that target – phuclv Mar 21 '15 at 07:05
  • Possible duplicate of [Which C datatype can represent a 40-bit binary number?](http://stackoverflow.com/questions/9595225/which-c-datatype-can-represent-a-40-bit-binary-number) – phuclv Apr 01 '17 at 02:43

14 Answers14

83

Yes, but...

It is certainly possible, but it is usually nonsensical (for any program that doesn't use billions of these numbers):

#include <stdint.h> // don't want to rely on something like long long
struct bad_idea
{
    uint64_t var : 40;
};

Here, var will indeed have a width of 40 bits at the expense of much less efficient code generated (it turns out that "much" is very much wrong -- the measured overhead is a mere 1-2%, see timings below), and usually to no avail. Unless you have need for another 24-bit value (or an 8 and 16 bit value) which you wish to pack into the same structure, alignment will forfeit anything that you may gain.

In any case, unless you have billions of these, the effective difference in memory consumption will not be noticeable (but the extra code needed to manage the bit field will be noticeable!).

Note:
The question has in the mean time been updated to reflect that indeed billions of numbers are needed, so this may be a viable thing to do, presumed that you take measures not to lose the gains due to structure alignment and padding, i.e. either by storing something else in the remaining 24 bits or by storing your 40-bit values in structures of 8 each or multiples thereof).
Saving three bytes a billion times is worthwhile as it will require noticeably fewer memory pages and thus cause fewer cache and TLB misses, and above all page faults (a single page fault weighting tens of millions instructions).

While the above snippet does not make use of the remaining 24 bits (it merely demonstrates the "use 40 bits" part), something akin to the following will be necessary to really make the approach useful in a sense of preserving memory -- presumed that you indeed have other "useful" data to put in the holes:

struct using_gaps
{
    uint64_t var           : 40;
    uint64_t useful_uint16 : 16;
    uint64_t char_or_bool  : 8;  
};

Structure size and alignment will be equal to a 64 bit integer, so nothing is wasted if you make e.g. an array of a billion such structures (even without using compiler-specific extensions). If you don't have use for an 8-bit value, you could also use an 48-bit and a 16-bit value (giving a bigger overflow margin).
Alternatively you could, at the expense of usability, put 8 40-bit values into a structure (least common multiple of 40 and 64 being 320 = 8*40). Of course then your code which accesses elements in the array of structures will become much more complicated (though one could probably implement an operator[] that restores the linear array functionality and hides the structure complexity).

Update:
Wrote a quick test suite, just to see what overhead the bitfields (and operator overloading with bitfield refs) would have. Posted code (due to length) at gcc.godbolt.org, test output from my Win7-64 machine is:

Running test for array size = 1048576
what       alloc   seq(w)  seq(r)  rand(w)  rand(r)  free
-----------------------------------------------------------
uint32_t    0      2       1       35       35       1
uint64_t    0      3       3       35       35       1
bad40_t     0      5       3       35       35       1
packed40_t  0      7       4       48       49       1


Running test for array size = 16777216
what        alloc  seq(w)  seq(r)  rand(w)  rand(r)  free
-----------------------------------------------------------
uint32_t    0      38      14      560      555      8
uint64_t    0      81      22      565      554      17
bad40_t     0      85      25      565      561      16
packed40_t  0      151     75      765      774      16


Running test for array size = 134217728
what        alloc  seq(w)  seq(r)  rand(w)  rand(r)  free
-----------------------------------------------------------
uint32_t    0      312     100     4480     4441     65
uint64_t    0      648     172     4482     4490     130
bad40_t     0      682     193     4573     4492     130
packed40_t  0      1164    552     6181     6176     130

What one can see is that the extra overhead of bitfields is neglegible, but the operator overloading with bitfield reference as a convenience thing is rather drastic (about 3x increase) when accessing data linearly in a cache-friendly manner. On the other hand, on random access it barely even matters.

These timings suggest that simply using 64-bit integers would be better since they are still faster overall than bitfields (despite touching more memory), but of course they do not take into account the cost of page faults with much bigger datasets. It might look very different once you run out of physical RAM (I didn't test that).

Michael Kohne
  • 11,888
  • 3
  • 47
  • 79
Damon
  • 67,688
  • 20
  • 135
  • 185
  • 1
    I was thinking the same thing, but bit-field members with more that 32 bits are a gcc extension and not part of the C standard (try compiling your code with `-Wpedantic`). – bitmask Dec 30 '14 at 12:34
  • 2
    Interesting... clang groks it just fine here (even with -Wpedantic). As does my GCC. Was that constraint to 32 bits maybe relaxed with C++11? – Damon Dec 30 '14 at 12:36
  • Ah, indeed. It seems so. I'm too lazy to check the standard right now, but my GCC seems to support your observation. +1 then – bitmask Dec 30 '14 at 12:40
  • As user user1810087 suggested in comments, can we use another variable for storing those 8 bits, What will be the difference in performance and memory against uint64_t var : 40; – Aragon Dec 30 '14 at 13:20
  • 2
    While this answer is not wrong, it does not really answer the question. – user694733 Dec 30 '14 at 13:22
  • @user1660982: Instruction pipelining makes it impossible to predict the exact overhead, but it typically takes 3 operations to read (load, shift, and) and 3 operations to write (and, shift, or) instead of one. Also, depending on the implementation, it might do an unaligned load, which may (or may not) be allowable on the target platform, but will usually be noticeably slower. – Damon Dec 30 '14 at 13:45
  • 11
    Also, structs containing bitfields are padded to the struct alignment, which is based on the bitfield allocation unit. So if this works, the struct will be padded out to 8 bytes anyways and you won't save any space. – Chris Dodd Dec 30 '14 at 21:16
  • @ChrisDodd: Please note the last paragraph which addresses this. – Damon Dec 31 '14 at 10:06
  • i would add actual compiler may (and probably) apply "data structure padding" so it will not loose efficiency... but you will lose bit, as it is used as the most small fast implementation, witch would probably be 32bit anyway. – Lesto Dec 31 '14 at 20:29
  • 3
    You can force byte packing in most compilers (it's a pragma that varies from compiler to compiler), which makes an array of the struct suitably reduced. – Joshua Dec 31 '14 at 21:13
  • Maybe you should update your answer so the names in the performance analysis are compatible with earlier declarations. Also, I deduced that the numbers in the performance analysis designate running times (greater=worse); you might want to mention that explicitly. People have expressed concern on padding and packing; you might want to fix your code to reflect this. Also, the link for off-site code doesn't work for me; is it dead already? I'd downvote this answer because of these problems, but I already did it earlier because of padding/packing, so not voting. – anatolyg Jan 01 '15 at 20:57
  • @anatolyg: The concerns about padding/packing are nonexistent. Structures obviously need to be packed correctly if memory usage is of any concern, but that is not particular to this problem. It is addressed in the answer, and respected in the benchmarking code (`sizeof(packed40_t)` is 320 bits, which is the smallest size to fit a number of 40-bit members without wasting space). The link works fine for me, do you have JS enabled? (godbolt.org needs that). – Damon Jan 01 '15 at 21:08
  • Maybe it was not clear from my comment, so I'll say it in other words: you can make your answer better by including relevant code (declarations of `packed40_t` and `bad40_t`). I can guess their declaration, but maybe OP (or any other person) cannot. Also, showing the syntax for `struct` packing is much better than writing that packing is possible. I can guess the syntax, but maybe OP cannot? BTW the link is good - I fixed some access issues at my side. But still, put relevant code in the answer, not in an external site. – anatolyg Jan 01 '15 at 21:42
  • It's odd that bad40_t gives the same performance as uint64_t. UNLESS - bad40_t contains just one bitfield and the compiler was smart enough to translate access to that bitfield into access to the full uint64_t struct. If that is the case, try adding another bitfield to bad40_t and see if the performance equivalence holds. – namey Jan 01 '15 at 21:53
  • @namey: Getting exactly the same timings with or without. What it does is read a 64-bit integer, and mask out 40 bits using bitwise AND, anyway. Plowing through a gigabyte of data (especially in random order when caches don't help much) is bound by memory bandwidth, so it is not at all surprising that the overhead for extracting the bitfield is a mere 1-2% total. Still, 1% slower is 1% slower, so at least according to this benchmark, it would be an advantage to simply use 64-bit integers (someone should maybe try making `N` 10-20 times bigger and see what the figures are when swap kicks in). – Damon Jan 02 '15 at 10:38
  • Can't you just malloc a big byte-array and just read the numbers as 64bit numbers, but masking them down to 40bit and managing the pointer manually ? – Falco Jan 07 '15 at 14:49
  • @Falco: Of course you can, but then you'll have to write even more complicated and ugly (and error-prone) code. The reason why one would want to use bitfields and overloading operator[] with a ref class (same thing that `std::bitset` does, too) is to hide the underlying data layout and letting the compiler do the heavy lifting while you write code like `elem[1999]` to get the 2,000th element (instead of figuring whether you need to load one or two 64-bit values and how to and-shift-or them together -- remember 64 is not a multiple of 40, so you don't always just load _one_ 64 bit value). – Damon Jan 07 '15 at 15:20
  • Why didn't you used `#pragma pack(1)` for any of these? – Inverse Jan 23 '15 at 00:19
  • @Inverse: For two reasons: First, pragmas are compiler-dependent, non-portable. Second, the pragma almost certainly wouldn't help here. You can tell the compiler to avoid extra padding such as happens between fundamental types of different sizes, but you can hardly force it to cram together two "half fundamental" types, which is the case here. – Damon Jan 23 '15 at 10:23
  • On an 8-bit architecture such as AVR or PIC, this would probably be *much* more efficient than attempting to perform 64-bit operations. – Caleb Reister Nov 24 '17 at 19:30
  • Why did you only use `gcc -O1` instead of `-O2` or `-O3`? At least `-O2` is standard; `-O1` leaves out lots of optimizations. (And BTW, I had a look at the asm for `struct __attribute__((packed)) bad40_t`, which makes it `good40_t`, except that gcc is pretty braindead and does 5 single-byte loads, even inside the loop where it knows it's safe to access the next byte. I'm not sure what gcc is doing for `packed40_t`. It may be branching every iteration instead of unrolling by 8 so it goes away. It's not using movzx byte loads (movzbl), though. I think it's using mov / and. – Peter Cordes Dec 16 '17 at 20:49
  • @PeterCordes: I didn't use `-O1`, I never compile with anything lower than `-O2`. Not sure where you got the idea from that I did use `-O1`, probably because that is what, by pure coincidence, is shown by the godbolt web interface (which I don't care about)? As stated in the answer, I posted the code which is about 4kB over there as that would be a bit long for being inline in the answer, that's the only reason. – Damon Dec 16 '17 at 21:24
  • Your Godbolt link has `-O`. I assumed (because you didn't say differently) that you included your compile options as well as source, so the asm output reflected what you actually ran. That's half the point of Godbolt instead of Pastebin... – Peter Cordes Dec 16 '17 at 21:27
  • You can make a 5-byte struct that compiles about as efficiently as possible (https://godbolt.org/g/vQGzbq) with gcc for separate structs instead of packed into groups with consistent alignment, using `char values[5]` instead of bitfields. I used memcpy to portably do unaligned load/stores into it. (As I said, gcc's bitfield code was horrible for a packed struct; this is much better). It might be worse than using a larger struct to group into a repeating pattern of alignment, though. – Peter Cordes Dec 17 '17 at 14:42
55

You can quite effectively pack 4*40bits integers into a 160-bit struct like this:

struct Val4 {
    char hi[4];
    unsigned int low[4];
}

long getLong( const Val4 &pack, int ix ) {
  int hi= pack.hi[ix];   // preserve sign into 32 bit
  return long( (((unsigned long)hi) << 32) + (unsigned long)pack.low[i]);
}

void setLong( Val4 &pack, int ix, long val ) {
  pack.low[ix]= (unsigned)val;
  pack.hi[ix]= (char)(val>>32);
}

These again can be used like this:

Val4[SIZE] vals;

long getLong( int ix ) {
  return getLong( vals[ix>>2], ix&0x3 )
}

void setLong( int ix, long val ) {
  setLong( vals[ix>>2], ix&0x3, val )
}
runec
  • 1,687
  • 12
  • 22
  • 13
    A code snippet that actually saves memory *after* padding is accounted for! +1 – Ben Voigt Dec 30 '14 at 20:57
  • 1
    Pro: this actually saves space. Con: this code is probably VERY slow due to the indexing. – SamB Dec 31 '14 at 06:09
  • 2
    It might be worth using `signed char hi[4];` explicitly; plain `char` may be signed or unsigned. – Jonathan Leffler Dec 31 '14 at 10:19
  • 4
    It might be better to use `uint_least32_t` and `int_least8_t` here, rather than `unsigned int` and `char`. `unsigned int` is only required to be at least 16 bits. `char` will always be at least 8 bits, so there's not as much of an issue there. Also, I'd use multiplication instead of bit shifting for the `hi` part of the value; that's well defined, and the compiler can substitute bit shifting if that's appropriate. Other than that, good idea! – Pete Becker Dec 31 '14 at 19:22
  • It wouldn't be hard to make a `vector` of these structures and provide a wrapper that makes the indexing transparent, so you can get to each 40-bit quantity individually. – Mark Ransom Jan 01 '15 at 01:02
  • moving `unsigned int low[4];` up before `char hi[4];` will be more portable, esp. to architectures where `sizeof(int) > 4*sizeof(char)` – phuclv Jan 01 '15 at 04:45
  • 11
    @SamB: It's not immediately clear that this will be "VERY" slow. The thing is that (assuming the compiler is set up to optimise aggressively - including inlining - as it should be for anything involving "billions" of operations!) all of the indexing boils down to CPU-internal operations on registers, which can be done in very few cycles (i.e. fast): typically much faster than retrieving a cache line from memory. Since in total we're now accessing 35% less memory than before (due to the space saving) we can end up with a net win. (Obviously this depends on a lot - measurement recommended :) ) – psmears Jan 01 '15 at 18:26
25

You might want to consider Variable-Lenght Encoding (VLE)

Presumably, you have store a lot of those numbers somewhere (in RAM, on disk, send them over the network, etc), and then take them one by one and do some processing.

One approach would be to encode them using VLE. From Google's protobuf documentation (CreativeCommons licence)

Varints are a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.

Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

So, for example, here is the number 1 – it's a single byte, so the msb is not set:

0000 0001

And here is 300 – this is a bit more complicated:

1010 1100 0000 0010

How do you figure out that this is 300? First you drop the msb from each byte, as this is just there to tell us whether we've reached the end of the number (as you can see, it's set in the first byte as there is more than one byte in the varint)

Pros

  • If you have lots of small numbers, you'll probably use less than 40 bytes per integer, in average. Possibly much less.
  • You are able to store bigger numbers (with more than 40 bits) in the future, without having to pay a penalty for the small ones

Cons

  • You pay an extra bit for each 7 significant bits of your numbers. That means a number with 40 significant bits will need 6 bytes. If most of your numbers have 40 significant bits, you are better of with a bit field approach.
  • You will lose the ability to easily jump to a number given its index (you have to at least partially parse all previous elements in an array in order to access the current one.
  • You will need some form of decoding before doing anything useful with the numbers (although that is true for other approaches as well, like bit fields)
glglgl
  • 89,107
  • 13
  • 149
  • 217
Sam
  • 19,708
  • 4
  • 59
  • 82
  • you can change the smallest unit to 16 or 32 bits so you may save a lot of memory if most of the values is more than 1 byte but fit within 15 or 31 bits – phuclv Dec 31 '14 at 08:30
  • 3
    If the numbers OP is trying to store are uniformly distributed, then there are many more large numbers than small ones, and variable-length encoding will be counter-productive. – Russell Borogove Dec 31 '14 at 20:18
21

(Edit: First of all - what you want is possible, and makes sense in some cases; I have had to do similar things when I tried to do something for the Netflix challenge and only had 1GB of memory; Second - it is probably best to use a char array for the 40-bit storage to avoid any alignment issues and the need to mess with struct packing pragmas; Third - this design assumes that you're OK with 64-bit arithmetic for intermediate results, it is only for large array storage that you would use Int40; Fourth: I don't get all the suggestions that this is a bad idea, just read up on what people go through to pack mesh data structures and this looks like child's play by comparison).

What you want is a struct that is only used for storing data as 40-bit ints but implicitly converts to int64_t for arithmetic. The only trick is doing the sign extension from 40 to 64 bits right. If you're fine with unsigned ints, the code can be even simpler. This should be able to get you started.

#include <cstdint>
#include <iostream>

// Only intended for storage, automatically promotes to 64-bit for evaluation
struct Int40
{
     Int40(int64_t x) { set(static_cast<uint64_t>(x)); } // implicit constructor
     operator int64_t() const { return get(); } // implicit conversion to 64-bit
private:
     void set(uint64_t x)
     {
          setb<0>(x); setb<1>(x); setb<2>(x); setb<3>(x); setb<4>(x);
     };
     int64_t get() const
     {
          return static_cast<int64_t>(getb<0>() | getb<1>() | getb<2>() | getb<3>() | getb<4>() | signx());
     };
     uint64_t signx() const
     {
          return (data[4] >> 7) * (uint64_t(((1 << 25) - 1)) << 39);
     };
     template <int idx> uint64_t getb() const
     {
          return static_cast<uint64_t>(data[idx]) << (8 * idx);
     }
     template <int idx> void setb(uint64_t x)
     {
          data[idx] = (x >> (8 * idx)) & 0xFF;
     }

     unsigned char data[5];
};

int main()
{
     Int40 a = -1;
     Int40 b = -2;
     Int40 c = 1 << 16;
     std::cout << "sizeof(Int40) = " << sizeof(Int40) << std::endl;
     std::cout << a << "+" << b << "=" << (a+b) << std::endl;
     std::cout << c << "*" << c << "=" << (c*c) << std::endl;
}

Here is the link to try it live: http://rextester.com/QWKQU25252

Stefan Atev
  • 485
  • 3
  • 11
  • Agreed with @Andreas, this is straightforward with predictable codegen, unlike the answers that use bit fields or rely on compiler-specific packing. [Here](http://coliru.stacked-crooked.com/a/9378428695d88a58) is a `constexpr`-ified C++17 implementation. – ildjarn Jun 26 '16 at 07:31
16

You can use a bit-field structure, but it's not going to save you any memory:

struct my_struct
{
    unsigned long long a : 40;
    unsigned long long b : 24;
};

You can squeeze any multiple of 8 such 40-bit variables into one structure:

struct bits_16_16_8
{
    unsigned short x : 16;
    unsigned short y : 16;
    unsigned short z :  8;
};

struct bits_8_16_16
{
    unsigned short x :  8;
    unsigned short y : 16;
    unsigned short z : 16;
};

struct my_struct
{
    struct bits_16_16_8 a1;
    struct bits_8_16_16 a2;
    struct bits_16_16_8 a3;
    struct bits_8_16_16 a4;
    struct bits_16_16_8 a5;
    struct bits_8_16_16 a6;
    struct bits_16_16_8 a7;
    struct bits_8_16_16 a8;
};

This will save you some memory (in comparison with using 8 "standard" 64-bit variables), but you will have to split every operation (and in particular arithmetic ones) on each of these variables into several operations.

So the memory-optimization will be "traded" for runtime-performance.

barak manos
  • 29,648
  • 10
  • 62
  • 114
  • @barakmanos: Are you certain that your new version does any better? – Ben Voigt Dec 30 '14 at 20:58
  • @BenVoigt: On VC2013 it does. What I'm not 100% certain of, is whether it does so according to the language standard, or if it is compiler dependent. If the latter is the case, then a `#pragma pack` should do "the rest of the job". By the way, there are other issues here, such as `CHAR_BIT` which can theoretically be more than 8, or `sizeof(short)` which can theoretically be 1 (for example, if `CHAR_BIT` is 16). I preferred to keep the answer simple an easy to read, rather than pointing out all these corner-cases. – barak manos Dec 30 '14 at 21:02
  • 1
    @MarcGlisse and by 64 you mean 8, because `sizeof` counts bytes. – user253751 Dec 31 '14 at 04:45
  • 1
    @Inverse: Thank you, but your edit of the first part made the opening statement of the second part **senseless**. In addition (and even worse), it was **wrong** - `sizeof(my_struct)` is not 5 bytes on every compiler (or possibly on any compiler). And in any case, you cannot instantiate an array of those structures that would reflect 5 bytes per entry. Please verify your changes before you commit them (in particular into answers of other users). – barak manos Dec 31 '14 at 20:36
  • @immibis No, I really meant 64, but that comment was posted before the edit (see the history if you want to see what it was about). – Marc Glisse Jan 05 '15 at 14:16
9

As the comments suggest, this is quite a task.

Probably an unnecessary hassle unless you want to save alot of RAM - then it makes much more sense. (RAM saving would be the sum total of bits saved across millions of long values stored in RAM)

I would consider using an array of 5 bytes/char (5 * 8 bits = 40 bits). Then you will need to shift bits from your (overflowed int - hence a long) value into the array of bytes to store them.

To use the values, then shift the bits back out into a long and you can use the value.

Then your RAM and file storage of the value will be 40 bits (5 bytes), BUT you must consider data alignment if you plan to use a struct to hold the 5 bytes. Let me know if you need elaboration on this bit shifting and data alignment implications.

Similarly, you could use the 64 bit long, and hide other values (3 chars perhaps) in the residual 24 bits that you do not want to use. Again - using bit shifting to add and remove the 24 bit values.

Grantly
  • 2,546
  • 2
  • 21
  • 31
6

If you have to deal with billions of integers, I'd try to encapuslate arrays of 40-bit numbers instead of single 40-bit numbers. That way, you can test different array implementations (e.g. an implementation that compresses data on the fly, or maybe one that stores less-used data to disk.) without changing the rest of your code.

Here's a sample implementation (http://rextester.com/SVITH57679):

class Int64Array
{
    char* buffer;
public:
    static const int BYTE_PER_ITEM = 5;

    Int64Array(size_t s)
    {
        buffer=(char*)malloc(s*BYTE_PER_ITEM);
    }
    ~Int64Array()
    {
        free(buffer);
    }

    class Item
    {
        char* dataPtr;
    public:
        Item(char* dataPtr) : dataPtr(dataPtr){}

        inline operator int64_t()
        {
            int64_t value=0;
            memcpy(&value, dataPtr, BYTE_PER_ITEM); // Assumes little endian byte order!
            return value;
        }

        inline Item& operator = (int64_t value)
        {
            memcpy(dataPtr, &value, BYTE_PER_ITEM); // Assumes little endian byte order!
            return *this;
        }
    };   

    inline Item operator[](size_t index) 
    {
        return Item(buffer+index*BYTE_PER_ITEM);
    }
};

Note: The memcpy-conversion from 40-bit to 64-bit is basically undefined behavior, as it assumes litte-endianness. It should work on x86-platforms, though.

Note 2: Obviously, this is proof-of-concept code, not production-ready code. To use it in real projects, you'd have to add (among other things):

  • error handling (malloc can fail!)
  • copy constructor (e.g. by copying data, add reference counting or by making the copy constructor private)
  • move constructor
  • const overloads
  • STL-compatible iterators
  • bounds checks for indices (in debug build)
  • range checks for values (in debug build)
  • asserts for the implicit assumptions (little-endianness)
  • As it is, Item has reference semantics, not value semantics, which is unusual for operator[]; You could probably work around that with some clever C++ type conversion tricks

All of those should be straightforward for a C++ programmer, but they would make the sample code much longer without making it clearer, so I've decided to omit them.

Niki
  • 15,662
  • 5
  • 48
  • 74
  • @anatolyg: I've tried to summarize your points in Note 2. You're welcome to add to that list ;-) – Niki Jan 01 '15 at 13:03
6

I'll assume that

  1. this is C, and
  2. you need a single, large array of 40 bit numbers, and
  3. you are on a machine that is little-endian, and
  4. your machine is smart enough to handle alignment
  5. you have defined size to be the number of 40-bit numbers you need

unsigned char hugearray[5*size+3];  // +3 avoids overfetch of last element

__int64 get_huge(unsigned index)
{
    __int64 t;
    t = *(__int64 *)(&hugearray[index*5]);
    if (t & 0x0000008000000000LL)
        t |= 0xffffff0000000000LL;
    else
        t &= 0x000000ffffffffffLL;
    return t;
}

void set_huge(unsigned index, __int64 value)
{
    unsigned char *p = &hugearray[index*5];
    *(long *)p = value;
    p[4] = (value >> 32);
}

It may be faster to handle the get with two shifts.

__int64 get_huge(unsigned index)
{
    return (((*(__int64 *)(&hugearray[index*5])) << 24) >> 24);
}
phuclv
  • 37,963
  • 15
  • 156
  • 475
cmm
  • 319
  • 3
  • 11
  • 4
    Note that code contains undefined behaviour as `unsigned char` is not guaranteed to have correct alignment for `__int64`. On some platform, such as x86-64, it won't _probably_ have much impact on unoptimized build (expect performance hit) but on others it's problematic - such as ARM. On optimized builds all bets are off as compiler is allowed to for example produce code using `movaps`. – Maciej Piechotka Dec 31 '14 at 06:44
  • 1
    Probably the simplest solution of all! – anatolyg Dec 31 '14 at 08:51
  • Sure, this looks sorta ugly in C with all the type-casting, but the resulting machine-code will be simple and fast. Your shift-version of get is highly likely faster as it doesn't branch. It could be optimised further by reading from 3 bytes before the number, thus saving the left-shift. – aaaaaaaaaaaa Dec 31 '14 at 22:58
  • 1
    You can leave it to the compiler to do efficient sign extension with [this way](https://graphics.stanford.edu/~seander/bithacks.html#FixedSignExtend). However this should be tested carefully because unaligned access may be very costly. Storing the 5th byte separately like in some other solutions might be better – phuclv Jan 01 '15 at 04:35
  • 1
    **You can use `memcpy` to portably express unaligned loads / stores**, without any strict-aliasing violations like that pointer-cast either. Modern compilers targeting x86 (or other platforms with efficient unaligned loads) will simply use an unaligned load or store. For example, here's (https://godbolt.org/g/3BFhWf) a hacked-up version of Damon's 40-bit integer C++ class which uses a `char value[5]` and compiles to the same asm as this with gcc for x86-64. (If you use the version that over-reads instead of doing separate loads, but that's also fairly good) – Peter Cordes Dec 17 '17 at 14:38
6

Another variation that may be helpful would be to use a structure:

typedef struct TRIPLE_40 {
  uint32_t low[3];
  uint8_t hi[3];
  uint8_t padding;
};

Such a structure would take 16 bytes and, if 16-byte aligned, would fit entirely within a single cache line. While identifying which of the parts of the structure to use may be more expensive than it would be if the structure held four elements instead of three, accessing one cache line may be much cheaper than accessing two. If performance is important, one should use some benchmarks since some machines may perform a divmod-3 operation cheaply and have a high cost per cache-line fetch, while others might have have cheaper memory access and more expensive divmod-3.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • Note that divmod-3 would probably actually be done by multiplying. – SamB Dec 31 '14 at 06:14
  • @SamB: It would indeed typically be done best with some sort of multiplying, but that could vary between implementations. On something like a Cortex-M0, a divmod3 of an arbitrary 32-bit number would be a somewhat expensive and having do do completely separate fetches for the 32-bit parts and 40-bit parts of a number would be no problem. – supercat Dec 31 '14 at 20:05
5

For the case of storing some billions of 40-bit signed integers, and assuming 8-bit bytes, you can pack 8 40-bit signed integers in a struct (in the code below using an array of bytes to do that), and, since this struct is ordinarily aligned, you can then create a logical array of such packed groups, and provide ordinary sequential indexing of that:

#include <limits.h>     // CHAR_BIT
#include <stdint.h>     // int64_t
#include <stdlib.h>     // div, div_t, ptrdiff_t
#include <vector>       // std::vector

#define STATIC_ASSERT( e ) static_assert( e, #e )

namespace cppx {
    using Byte = unsigned char;
    using Index = ptrdiff_t;
    using Size = Index;

    // For non-negative values:
    auto roundup_div( const int64_t a, const int64_t b )
        -> int64_t
    { return (a + b - 1)/b; }

}  // namespace cppx

namespace int40 {
    using cppx::Byte;
    using cppx::Index;
    using cppx::Size;
    using cppx::roundup_div;
    using std::vector;

    STATIC_ASSERT( CHAR_BIT == 8 );
    STATIC_ASSERT( sizeof( int64_t ) == 8 );

    const int bits_per_value    = 40;
    const int bytes_per_value   = bits_per_value/8;

    struct Packed_values
    {
        enum{ n = sizeof( int64_t ) };
        Byte bytes[n*bytes_per_value];

        auto value( const int i ) const
            -> int64_t
        {
            int64_t result = 0;
            for( int j = bytes_per_value - 1; j >= 0; --j )
            {
                result = (result << 8) | bytes[i*bytes_per_value + j];
            }
            const int64_t first_negative = int64_t( 1 ) << (bits_per_value - 1);
            if( result >= first_negative )
            {
                result = (int64_t( -1 ) << bits_per_value) | result;
            }
            return result;
        }

        void set_value( const int i, int64_t value )
        {
            for( int j = 0; j < bytes_per_value; ++j )
            {
                bytes[i*bytes_per_value + j] = value & 0xFF;
                value >>= 8;
            }
        }
    };

    STATIC_ASSERT( sizeof( Packed_values ) == bytes_per_value*Packed_values::n );

    class Packed_vector
    {
    private:
        Size                    size_;
        vector<Packed_values>   data_;

    public:
        auto size() const -> Size { return size_; }

        auto value( const Index i ) const
            -> int64_t
        {
            const auto where = div( i, Packed_values::n );
            return data_[where.quot].value( where.rem );
        }

        void set_value( const Index i, const int64_t value ) 
        {
            const auto where = div( i, Packed_values::n );
            data_[where.quot].set_value( where.rem, value );
        }

        Packed_vector( const Size size )
            : size_( size )
            , data_( roundup_div( size, Packed_values::n ) )
        {}
    };

}    // namespace int40

#include <iostream>
auto main() -> int
{
    using namespace std;

    cout << "Size of struct is " << sizeof( int40::Packed_values ) << endl;
    int40::Packed_vector values( 25 );
    for( int i = 0; i < values.size(); ++i )
    {
        values.set_value( i, i - 10 );
    }
    for( int i = 0; i < values.size(); ++i )
    {
        cout << values.value( i ) << " ";
    }
    cout << endl;
}
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • I think you're assuming 2's complement for sign-extension. It think it breaks with sign/magnitude, but might work with 1's complement. Anyway, for 2's complement it would probably be easier and more efficient to ask the compiler to sign-extend the last byte to 64 bits for you, then OR in the low half. (Then x86 compilers could use a `movsx` byte-load, a shift, and then OR in the low 32 bits. Most other architectures also have sign-extending narrow loads) You're already depending on the implementation-defined behaviour of shifting negative numbers to do what you want. – Peter Cordes Dec 17 '17 at 14:04
  • @PeterCordes: Thanks, there's an unmentioned assumption of two's complement form in there, yes. Don't know why I relied on that. Puzzling. – Cheers and hth. - Alf Dec 17 '17 at 14:42
  • I wouldn't sacrifice efficiency just to make it portable to platforms nobody will ever use it on. But if possible use `static_assert` to check the semantics you rely on. – Peter Cordes Dec 17 '17 at 14:44
3

Yes, you can do that, and it will save some space for large quantities of numbers

You need a class that contains a std::vector of an unsigned integer type.

You will need member functions to store and to retrieve an integer. For example, if you want do store 64 integers of 40 bit each, use a vector of 40 integers of 64 bits each. Then you need a method that stores an integer with index in [0,64] and a method to retrieve such an integer.

These methods will execute some shift operations, and also some binary | and & .

I am not adding any more details here yet because your question is not very specific. Do you know how many integers you want to store? Do you know it during compile time? Do you know it when the program starts? How should the integers be organized? Like an array? Like a map? You should know all this before trying to squeeze the integers into less storage.

Hans Klünder
  • 2,176
  • 12
  • 8
  • 40*64=2560bit can be reduced to lcm(40,64)=320bit per "block", ie. 5 64bit-ints – deviantfan Dec 30 '14 at 12:39
  • 3
    `std::vector<>` is definitely not the way to go: it has a foot-print of at least three pointers, i. e. 96 or 192 bits depending on architecture. That is *much* worse than the 64 bits of a `long long`. – cmaster - reinstate monica Dec 30 '14 at 13:36
  • 3
    Depends. One std::vector for 100000000 integers is fine. If we are to design small blocks as in another answer, std::vector would be a waste of space. – Hans Klünder Dec 30 '14 at 14:53
3

This begs for streaming in-memory lossless compression. If this is for a Big Data application, dense packing tricks are tactical solutions at best for what seems to require fairly decent middleware or system-level support. They'd need thorough testing to make sure one is able to recover all the bits unharmed. And the performance implications are highly non-trivial and very hardware-dependent because of interference with the CPU caching architecture (e.g. cache lines vs packing structure). Someone mentioned complex meshing structures : these are often fine-tuned to cooperate with particular caching architectures.

It's not clear from the requirements whether the OP needs random access. Given the size of the data it's more likely one would only need local random access on relatively small chunks, organised hierarchically for retrieval. Even the hardware does this at large memory sizes (NUMA). Like lossless movie formats show, it should be possible to get random access in chunks ('frames') without having to load the whole dataset into hot memory (from the compressed in-memory backing store).

I know of one fast database system (kdb from KX Systems to name one but I know there are others) that can handle extremely large datasets by seemlessly memory-mapping large datasets from backing store. It has the option to transparently compress and expand the data on-the-fly.

GuyB
  • 61
  • 4
3

There are quite a few answers here covering implementation, so I'd like to talk about architecture.

We usually expand 32-bit values to 64-bit values to avoid overflowing because our architectures are designed to handle 64-bit values.

Most architectures are designed to work with integers whose size is a power of 2 because this makes the hardware vastly simpler. Tasks such as caching are much simpler this way: there are a large number of divisions and modulus operations which can be replaced with bit masking and shifts if you stick to powers of 2.

As an example of just how much this matters, The C++11 specification defines multithreading race-cases based on "memory locations." A memory location is defined in 1.7.3:

A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width.

In other words, if you use C++'s bitfields, you have to do all of your multithreading carefully. Two adjacent bitfields must be treated as the same memory location, even if you wish computations across them could be spread across multiple threads. This is very unusual for C++, so likely to cause developer frustration if you have to worry about it.

Most processors have a memory architecture which fetches 32-bit or 64-bit blocks of memory at a time. Thus use of 40-bit values will have a surprising number of extra memory accesses, dramatically affecting run-time. Consider the alignment issues:

40-bit word to access:   32-bit accesses   64bit-accesses
word 0: [0,40)           2                 1
word 1: [40,80)          2                 2
word 2: [80,120)         2                 2
word 3: [120,160)        2                 2
word 4: [160,200)        2                 2
word 5: [200,240)        2                 2
word 6: [240,280)        2                 2
word 7: [280,320)        2                 1

On a 64 bit architecture, one out of every 4 words will be "normal speed." The rest will require fetching twice as much data. If you get a lot of cache misses, this could destroy performance. Even if you get cache hits, you are going to have to unpack the data and repack it into a 64-bit register to use it (which might even involve a difficult to predict branch).

It is entirely possible this is worth the cost

There are situations where these penalties are acceptable. If you have a large amount of memory-resident data which is well indexed, you may find the memory savings worth the performance penalty. If you do a large amount of computation on each value, you may find the costs are minimal. If so, feel free to implement one of the above solutions. However, here are a few recommendations.

  • Do not use bitfields unless you are ready to pay their cost. For example, if you have an array of bitfields, and wish to divide it up for processing across multiple threads, you're stuck. By the rules of C++11, the bitfields all form one memory location, so may only be accessed by one thread at a time (this is because the method of packing the bitfields is implementation defined, so C++11 can't help you distribute them in a non-implementation defined manner)
  • Do not use a structure containing a 32-bit integer and a char to make 40 bytes. Most processors will enforce alignment and you wont save a single byte.
  • Do use homogenous data structures, such as an array of chars or array of 64-bit integers. It is far easier to get the alignment correct. (And you also retain control of the packing, which means you can divide an array up amongst several threads for computation if you are careful)
  • Do design separate solutions for 32-bit and 64-bit processors, if you have to support both platforms. Because you are doing something very low level and very ill-supported, you'll need to custom tailor each algorithm to its memory architecture.
  • Do remember that multiplication of 40-bit numbers is different from multiplication of 64-bit expansions of 40-bit numbers reduced back to 40-bits. Just like when dealing with the x87 FPU, you have to remember that marshalling your data between bit-sizes changes your result.
Cort Ammon
  • 10,221
  • 31
  • 45
  • If your numbers are contiguous, (e.g. `struct { char val[5]; };` with memcpy), the multiple loads or stores will be to the *same* cache line. That is cheap (if you weren't bottlenecked on instruction or L1D throughput before) and won't cause extra cache misses, but will defeat auto-vectorization so you might not even keep up with memory for sequential access. (Typically you'd expect it to compile to a 32-bit + an 8-bit load, on targets that support unaligned loads. Modern x86 has low penalties for cache-line splits, although when a word load splits across a 4k page, the penalty is higher). – Peter Cordes Dec 17 '17 at 13:36
  • A pack/unpack strategy that involves a branch is possible but barely worth mentioning unless you manually get super low-level with `uintptr_t` and alignment checks / wide loads ([like you might consider in asm](https://stackoverflow.com/questions/47832367/how-to-move-3-bytes-24bits-from-memory-to-a-register)). Or were you talking about doing this on top of `uint64_t []` and using an `if` to figure out if you need only one load? That sounds like a bad idea vs. just using shifts to split / merge uint64_t to/from `uint32_t` and `uint8_t`, and memcpy or use a struct to group for alignment. – Peter Cordes Dec 17 '17 at 13:47
  • According to ISO C++11, you can delimit "a memory location" with a zero-width bitfield. I'm not sure the standard implies that an array of `struct __attribute__((packed)) { unsigned long long v:40; };` would really be a single giant memory location; but even if struct boundaries aren't memory location boundaries, you could use an `int end:0` to guarantee that ([modulo compiler bugs](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080)!, and https://stackoverflow.com/questions/47008183/race-condition-when-accessing-adjacent-members-in-a-shared-struct-according-to/47297375#47297375) – Peter Cordes Dec 17 '17 at 13:52
2

If what you really want is an array of 40 bit integers (which obviously you can't have), I'd just combine one array of 32 bit and one array of 8 bit integers.

To read a value x at index i:

uint64_t x = (((uint64_t) array8 [i]) << 32) + array32 [i];

To write a value x to index i:

array8 [i] = x >> 32; array32 [i] = x;

Obviously nicely encapsulated into a class using inline functions for maximum speed.

There is one situation where this is suboptimal, and that is when you do truly random access to many items, so that each access to an int array would be a cache miss - here you would get two cache misses every time. To avoid this, define a 32 byte struct containing an array of six uint32_t, an array of six uint8_t, and two unused bytes (41 2/3rd bits per number); the code to access an item is slightly more complicated, but both components of the item are in the same cache line.

gnasher729
  • 51,477
  • 5
  • 75
  • 98