This is in C++, and I used the time command on Unix and found that copyij() was significantly faster.
void copyij(long int src[MSize][MSize], long int dst[MSize][MSize])
{
long int i,j;
for (i = 0; i < MSize; i++)
for (j = 0; j < MSize; j++)
dst[i][j] = src[i][j];
}
copyji() is similar, except the j loop is before the i-loop.