open mpi starts very fast but slows down massively soon after

Question

I have a C program that takes a very large file (can be 5GB to 65GB) and transposes the data in the file and then writes out the transposed data to other files. In total, the results files are approx 30 times larger due to the transformation. I am using open mpi so each processor used writes to it's own file.

Each processor writes the first ~18 GB of data to it's own results file at a very fast speed. However, at this stage the program slows to a crawl and the %CPU on the top command output drops drastically from ~100% to 0.3%.

Can anyone suggest a reason for this? Am I reaching some system limit?

Code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>


unsigned long long impute_len=0;

void write_results(unsigned long long, unsigned long long, int);



void main(int argc, char **argv){

    // the impute output
    impute_fp=fopen("infile.txt", "r");

    // find input file length
    fseek(impute_fp, 0, SEEK_END);
    impute_len=ftell(impute_fp);


    //mpi magic - hopefully!
    MPI_Status status;
    unsigned long long proc_id, ierr, num_procs, tot_recs, recs_per_proc, 
        root_recs, start_byte, end_byte, start_recv, end_recv; 


    // Now replicte this process to create parallel processes.
    ierr = MPI_Init(&argc, &argv);


    //find out process ID, and how many processes were started. 
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &proc_id);
    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_procs);

    if(proc_id == 0){

        tot_recs = impute_len/54577;    //54577 is length of each line
        recs_per_proc = tot_recs/num_procs;

        if(tot_recs % num_procs != 0){
            recs_per_proc=recs_per_proc+1;
            root_recs = tot_recs-(recs_per_proc*(num_procs-1));

        }else{
            root_recs = recs_per_proc;
        }


        //distribute a portion to each child process 
        int z=0;
        for(int x=1; x<num_procs; x++){
            start_byte = ((root_recs*54577))+(z*(recs_per_proc*54577));
            end_byte = ((root_recs*54577))+((z+1)*(recs_per_proc*54577));

            ierr = MPI_Send(&start_byte, 1 , MPI_UNSIGNED_LONG_LONG, x, 0, MPI_COMM_WORLD);

            ierr = MPI_Send(&end_byte, 1 , MPI_UNSIGNED_LONG_LONG, x, 0, MPI_COMM_WORLD);

            z++;
        }


        //root proc bit of work
        write_results(0, (root_recs*54577), proc_id);


    }else{
        //must be a slave process

        ierr = MPI_Recv(&start_recv, 1, MPI_UNSIGNED_LONG_LONG, 0, 0, MPI_COMM_WORLD, &status);

        ierr = MPI_Recv(&end_recv, 1, MPI_UNSIGNED_LONG_LONG, 0, 0, MPI_COMM_WORLD, &status);

        //Write my portion of file
        write_results(start_recv, end_recv, proc_id);

    }

    ierr = MPI_Finalize();
    fclose(impute_fp);

}


void write_results(unsigned long long start, unsigned long long end, int proc_id){  

    **logic to write out transposed data here

    }

    fclose(results_fp);

}

First, the file is written to the memory cache (fast) and then the cache gets full and the file is written to disk (slow). — Gilles Gouaillardet, Mar 26 '18 at 15:18
@GillesGouaillardet On my machine, each CPU has a cache size of 20480 KB. The slow down happens when the results files reach approx 18 or 19 GB in size. Surely the massive reduction in %CPU would occur a lot sooner if this was the reason? — daragh, Mar 26 '18 at 15:42
The slowdown [probably] isn't at cache size/limit but at RAM size/limit. And, for this amount of data, you may want to use `read(2)` [vs `fread`] and possibly `mmap` the file. Also, how many processes are you using and how many cores do you have? — Craig Estey, Mar 26 '18 at 15:49
Hi @CraigEstey, I think you are correct. It seems to be a limit in RAM. I am using 4 processes and I have 32 cores. I am using fread but I am also reading in the file in "chunks" of 0.5GB at a time. When I run the program without printing to file it runs fast. Therefore it seems to be to do with writing rather than reading? — daragh, Mar 26 '18 at 16:16
Optimal block/transfer size for FS is (e.g.) 64KB. If you use a large read to memory, during that time, you can't be writing (i.e. that (first) read time is "dead" time when you could be writing--particularly if you write to a _different_ physical disk). See my answer: https://stackoverflow.com/questions/37172740/how-does-mmap-improve-file-reading-speed/37173063#37173063 (and follow the links to my other two answers as well) for ways to speed up file copying/transfer. — Craig Estey, Mar 26 '18 at 16:51
BTW, if all your CPUs/cores are on the same motherboard, you could just do a `mmap` of the input file, and use different _threads_ (i.e. dispense with `mpi` altogether). Also, to get file size, consider `stat(2)` vs the `fseek/ftell`. Further, number of processes/threads is a tuning parameter. Eventually, you'll reach disk transfer saturation or memory bus saturation (the latter [per a cppcon talk] at about 4 threads, depending on memory bus topology). — Craig Estey, Mar 26 '18 at 16:58
@daragh I meant [Disk buffer](https://en.wikipedia.org/wiki/Disk_buffer), sorry for the confusion — Gilles Gouaillardet, Mar 26 '18 at 23:43

open mpi starts very fast but slows down massively soon after

0 Answers0