In my usual role of MPI-IO advocate, please consider MPI-IO for this kind of problem. You may be able to skip the gather entirely by having each process write to the file. Furthermore, if N is large, you do not want N file operations. Let MPI (either directly or through a library) make your life easier.
First off, is a text-based output format really what you (and your collaborators) want? if universe.out gets large enough that you want to read it in parallel, you are going to have a challenge decomposing the file across processors. Consider parallel HDF5 (phdf5) or paralllel-netcdf (pnetcdf) or any other of the higher-level libraries with a self-describing portable file format.
Here's an example of how you would write out all the x values then all the y values.
#include <stdio.h>
#include <mpi.h>
#include <unistd.h> //getpid
#include <stdlib.h> //srandom, random
#define MAX_PARTS 10
int main(int argc, char **argv) {
int rank, nprocs;
MPI_Offset nparts; // more than 2 billion particle possible
int i;
double *x, *y;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Offset start=0, x_end;
MPI_Info info;
MPI_File fh;
MPI_Status status;
srandom(getpid());
/* demonstrate this works even when particles are not evenly
* distributed over the processes */
nparts=((double)random()/RAND_MAX)*MAX_PARTS;
x = malloc(nparts*sizeof(*x));
y = malloc(nparts*sizeof(*y));
for (i = 0; i< nparts; i++) {
/* just some bogus data to see what happens */
x[i] = rank*100+i;
y[i] = rank*200+i;
}
/* not using this now. might tune later if needed */
MPI_Info_create(&info);
MPI_File_open(MPI_COMM_WORLD, "universe.out",
MPI_MODE_CREATE|MPI_MODE_WRONLY, info, &fh);
MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, "native", info);
MPI_Scan(&nparts, &start, 1, MPI_OFFSET, MPI_SUM, MPI_COMM_WORLD);
/* MPI_Scan is a prefix reduction: remove our contribution */
x_end = start; /* only the last rank will use this in the bcast below */
start -= nparts;
MPI_Bcast(&x_end, 1, MPI_OFFSET, nprocs-1, MPI_COMM_WORLD);
MPI_File_write_at_all(fh, start, x, nparts, MPI_DOUBLE, &status);
MPI_File_write_at_all(fh, start+x_end, y, nparts, MPI_DOUBLE, &status);
MPI_Info_free(&info);
MPI_File_close(&fh);
MPI_Finalize();
}
Writing out the x array and y array as pairs of (x,y) values is possible, but
a bit more complicated (You would have to make an MPI datatype).
Parallel-NetCDF has an "operation combining" optimization that does this for
you. Parallel-HDF5 has a "multi-dataset i/o" optimization in the works for
the next release. With those optimizations you could define a 3d array, with
one dimension for x and y and the third for "particle identifier". Then you
could post the operation for the x values, the operation for the y values, and
let the library stich all that together into one call.