2

I have a problem with the creation of a too big matrix with slurm cluster(Out of memory killed). How can I fix the problem? The following code is the part of the code about the allocation matrix:

double  **matrix;
int rows = 30000;
int cols = 39996;
matrix = (double**)malloc(sizeof(double*)*rows);
for (int i = 0; i < rows; i++)
    matrix[i] = (double*)malloc(sizeof(double)*cols);

for(int i=0; i<rows; i++)
    for(int j=0; j<cols; j++)
        matrix[i][j] = 1;

This value (rows, cols) are an example because I can also have larger value Instead the following code is the part of code about deallocation:

for (int i = 0; i < 30000; i++)
    free(matrix[i]);
free(matrix);

This my output: Slurmstepd: error: Detected 1 oom-kill event(s) in step 98584.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: lab13p1: task 1: Out Of Memory

Snake91
  • 21
  • 1
  • 4
  • 2
    `matrix` has to be a double pointer, typo? – anastaciu May 27 '20 at 16:36
  • It's a heavy piece of code but there is nothing wrong with it appart from the first cast, at worst it should be `(double**)` at best nothing and include `stdlib.h`. It seems that don't have enough memory. – anastaciu May 27 '20 at 17:03
  • 1
    You're trying to create a matrix with about 1.2 billion doubles in it (the row pointers are lost in the noise). That requires about 10 GiB of heap memory. Do you have that much memory available? A 16 GiB machine is not very unusual, but 8 GiB machines are probably more common. – Jonathan Leffler May 27 '20 at 17:10
  • 1
    Aaah, the [out-of-fuel killer](https://lwn.net/Articles/104185/) abomination. Disable the OOM killer, disable memory overcommit, make sure you have enough swap space on disk, and never deal with something that indiscriminately kills processes again. – Andrew Henle May 27 '20 at 17:36

2 Answers2

3
  1. change the declaration of matrix to double pointer (maybe it's typo):
double  **matrix;
  1. You should verify the return value of malloc function, especially with too big matrix.

  2. Do not cast malloc function. Do I cast the result of malloc?

matrix = malloc(sizeof(double*)*rows);
if(!matrix) {
   // handle the error
}

for (int i = 0; i < rows; i++) {
    matrix[i] = malloc(sizeof(double)*cols);
    if(!matrix[i]) {
       // handle the error
    }
}

Hitokiri
  • 3,607
  • 1
  • 9
  • 29
  • Sorry, I wrong to copy my code, the declaration was right. I add your control instruction but i receive same output. – Snake91 May 27 '20 at 16:53
  • @Snake91 The first `malloc` (`matrix = (double)malloc(sizeof(double*)*rows);`) is typo, no ? – Hitokiri May 27 '20 at 16:57
  • @Hitokiri generally *cast* for `malloc` result as just useless, but the one with `(float)` can modify the address, in fact it is strange the compiler does not produce an error, all of that to say may be you can insist more in your answer about that potentiality destructing cast – bruno May 27 '20 at 17:12
  • @bruno many thanks for your comment, it's helpful for me also. But we have to wait the OP to verify that the cast in the allocation of `matrix` is typo or not (i asked him with the comment above). If not, i will update your comment in the answer. – Hitokiri May 27 '20 at 17:20
  • Yes sorry is another typo, i have correct a code but out-of-memory exists yet. I can't run the simulation when the matrix is too big. Are there alternative way to fix the problem? – Snake91 May 27 '20 at 17:41
  • @Snake91: you need to fix the code so it will work correctly at smaller memory sizes. To get bigger sizes, you'll need more memory, or you'll have to quell the virulent OOM Killer (Out of Memory Killer). But be wary of too much over-commitment of virtual memory. That's a big array at 30k x 40k. You may have to redeisgn your code to use a smaller matrix somehow. – Jonathan Leffler May 27 '20 at 17:45
  • @JonathanLeffler With small matrix it works. Is it possibile to hide this OOM killer? – Snake91 May 27 '20 at 17:54
  • See @AndrewHenle's [comment](https://stackoverflow.com/questions/62048131/out-of-memory-kill/62048275?noredirect=1#comment109743744_62048131) – Jonathan Leffler May 27 '20 at 17:56
  • @Snake91, I'm curious why you want a very very very big size of matrix like that ? – Hitokiri May 27 '20 at 18:01
1

Looks like you're running on slurm, so probably on a cluster with shared access.

It's possible that cluster management limited the amount of memory per job and per cpu.

Check the memory limits in the docs for your cluster. You can also see some limits in the config with scontrol show config. Look for stuff like MaxMemPerCPU, MaxMemPerNode, DefMemPerCPU.

Maybe it just uses the last to set a default setup per job, and you can change it in your launch commands (srun, sbatch) with --mem-per-cpu=8G.

You can also see what settings your jobs are running with with the squeue command. Look at the man page for the -o option, there, you can add outputs (-o %i %m shows the job ID and the amount of memory it can run with).

nab
  • 191
  • 3
  • 9