What you want to achieve is basically a matrix transpose and then write the data to a file.
What is the most efficient way to transpose a matrix is a complex question, and really depends on your architecture.
If you're not really concerned about super optimizing that for your processor (or accelerator), I would go for a simple nested for loop, accumulating data in intermediate memory representation:
string[] lines = new string[NN[0].Count]; // assume all lines have equal length
for(int i = 0; i < NN.Count; ++i) {
for(int j = 0; j < NN[i].Count; ++j) {
lines[j] += NN[i][j] + ((i==NN.Count - 1) ? "" : ",");
}
}
File.WriteAllLines("path.csv", lines);
As first optimization pass, I wouldn't recommend using a list of lists, since accessing elements will be quite intensive. A bidimensional array would make a better job.
int[,] NN = new int[3,6] {{1, 2, 3, 4, 5, 6 }, {2, 5, 6, 3, 1, 0}, {0, 9, 2, 6, 7, 8}};
string[] lines = new string[NN.GetLength(1)];
for (int i = 0; i < NN.GetLength(0); ++i)
{
for (int j = 0; j < NN.GetLength(1); ++j)
{
lines[j] += NN[i,j] + ((i == NN.GetLength(0) - 1) ? "" : ",");
}
}
File.WriteAllLines("path.csv", lines);
Here is a performance test, for 500x500 elements (without counting the write to file):

To improve this solution I would first make the transpose in memory (without writing anything to a file or strings), and then perform a join(,) and write to a file (as a single byte array).
If you want to further optimize, think there is always room for it :)
For example, on x86, depending on the instruction set you have, you can read this article. On a CUDA enabled device, you can read that.
Anyway, a good solution will always involve aligned memory, sub-blocking, and close-to-metal written code (intrinsics or assembly).