0

I'm running a simulation with 20000 sweeps, and for every sweep, I need to save a very large array of integers with the values 1, and -1 to a data file. The number of integers printed can reach 1e11.

my problem is that the size of the data file becomes huge (up to 50GB)

I'm currently using fprintf with int datatype.

does anyone know a way to save these values without taking that much disk space?

Thanks in advance

kaylum
  • 13,833
  • 2
  • 22
  • 31
Sara Awwad
  • 40
  • 6
  • 5
    If the integer can only take on two values, then you can fit each entry into a single bit using techniques such as [this one](https://stackoverflow.com/questions/18722460/correct-way-to-convert-8-bools-into-1-byte) and binary (rather than text) file handling. With this technique, 1e11 bits will require [12.5 GB](https://www.wolframalpha.com/input?i=%281e11%29+bits+to+gigabytes) which is the best you can do without finding specific patterns. Please take a look at these (or other techniques you're considering) and [edit] your question with additional detail if you end up needing help with them. – nanofarad Jun 20 '22 at 21:58
  • 3
    Another option that may be viable with that data set is to apply simple [run length encoding](https://en.wikipedia.org/wiki/Run-length_encoding). – kaylum Jun 20 '22 at 22:02
  • Note that your constraint is *disk space*, not *memory*. – Nate Eldredge Jun 20 '22 at 22:08
  • Similar to RLE is if there are relatively few of one value, just save a list of their indexes after the first two entries which says how many, and which value. – Weather Vane Jun 20 '22 at 22:09
  • @WeatherVane RLE doesn't work in my case because I need the indices of the values and keeping a list won't help either because, in the end, I will have to print them to a data folder that will be used in matlab, which will lead me to print the same amount of data, but thanks for the tip :) – Sara Awwad Jun 20 '22 at 22:36
  • @NateEldredge Thanks for the heads up, I just changed it – Sara Awwad Jun 20 '22 at 22:36
  • @nanofarad would this work in C? I think the link you attached answers a different question, could you please check if its the right link? – Sara Awwad Jun 20 '22 at 22:41
  • @SaraAwwad dividing data into chunks (like 1 KB ... 1 MB ... 1 GB depending on your constraints) you can decompress into memory (basically a cache) on demand would allow using RLE encoding on the chunks while also being able to use indexing efficiently. – hyde Jun 20 '22 at 22:43
  • @SaraAwwad Sorry, that is indeed the wrong language (I was inattentive while searching). [this](https://stackoverflow.com/a/8461163/1424875) is for C++, but you should be able to use the first two code samples with some small modifications (and `#include ` since they don't make use of C++-specific constructs. – nanofarad Jun 20 '22 at 22:48
  • 1
    "*RLE doesn't work in my case because I need the indices of the values*". It's not clear what you mean by that. RLE does not prevent you from calculating the index of any of the data values. – kaylum Jun 20 '22 at 22:49
  • 1
    One bit per number + compression algorithm? zstd or gzip or whatever – Shawn Jun 20 '22 at 23:18
  • @kaylum I'm sorry I meant sth else RLE requires the data to be repetitive to work, (i.e it works if I have sth like this: [11111-1-1-1-111111]), but it makes the dataset bigger if I have too many singular data points (i.e sth like this [1-11-1-11-111]) which is why using it will only make things worse It's also explained in the wiki page shared with me above – Sara Awwad Jun 21 '22 at 17:01

1 Answers1

1
  1. If your numbers are only -1 and 1 then it can be represented in one bit. If number 0 also exists you will need an additional one bit to represent 3 values (2 bits can actually hold 4 distinct values).
  2. 1e11 bits can be represented in 12.5e9 bytes. If two bits are required then 25.0e9 bytes will be needed. It is not a problem for most modern computers. You may even have enough RAM to keep it in RAM or map it.

for single bit:

int getVal1bit(const void *buff, size_t index)
{
    const unsigned char *ucbuff = buff;

    ucbuff += index / CHAR_BIT;

    return *ucbuff & (1 << (index % 8)) ? 1 : -1;
}

void putVal1bit(void *buff, size_t index, int val)
{
    unsigned char *ucbuff = buff;

    ucbuff += index / CHAR_BIT;

    if(val < 0) *ucbuff |= (1 << (index % 8));
    else *ucbuff &= ~(1 << (index % 8));
}
0___________
  • 60,014
  • 4
  • 34
  • 74