2

I have function that generates numbers and stores them to List<int>.
Now I must store those results into files as fast as possible.

Here is my code so far:

private void Save_Click(object sender, EventArgs e)
{
    //this is just for tests
    List<int> myResults = Enumerable.Range(1, 50000000).ToList();
    const string dir = @"D:\TESTS";

    int fileCount = 1;
    var file = Path.Combine(dir, string.Format("{0}.csv", fileCount));
    var sw = new StreamWriter(file, false);
    int i = 0;

    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    foreach (int res in myResults.Shuffle())
    {
        sw.WriteLine(res);
        i++;
        if (i%200000 != 0) continue;
        fileCount++;
        sw.Close();
        file = Path.Combine(dir, string.Format("{0}.csv", fileCount));
        sw = new StreamWriter(file, false);
    }

    sw.Close();
    stopwatch.Stop();

    label3.Text = string.Format("Save time(s): {0:0.##}", stopwatch.Elapsed.TotalSeconds);
}

Shuffle is extension method taken from this answer.

public static class Extensions
{
    public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random rng = null)
    {
        if (rng == null)
            rng = new Random();

        T[] elements = source.ToArray();
        for (int i = elements.Length - 1; i > 0; i--)
        {
            int swapIndex = rng.Next(i + 1);
            yield return elements[swapIndex];
            elements[swapIndex] = elements[i];
        }
        yield return elements[0];
    }
}

My problem is that save takes about 5-7 minutes on my PC and when I increase number of results to 100 millions I get OutOfMemoryException.

How can I speed thing up and eliminate that error?

Community
  • 1
  • 1
Misiu
  • 4,738
  • 21
  • 94
  • 198
  • Are you getting the memory exception during the shuffle or in writing out to file? – Jason W Oct 13 '15 at 12:29
  • @Misiu what is the range you are passing into `Shuffle()` method in your code? Also do you see any change if instead of newing up the `StreamWriter` inside foreach, just call `dispose()` at the end of `foreach` and new it up at the start of loop? – Siva Gopal Oct 13 '15 at 12:34
  • @SivaGopal I need to shuffle all results, so I'm passing 50 millions, but Shuffle is using yield. I'll add that function to my question – Misiu Oct 13 '15 at 12:41
  • @JasonW now inside Shuffle I'm using `yield` so for now I don't get exception. I'd like to optimize saving to file to put StreamWriter inside using statement, but I don't know how. – Misiu Oct 13 '15 at 12:44
  • 3
    Even though `Shuffle` uses `yield`, it still materializes the entire data set (see `source.ToArray()`). – Codo Oct 13 '15 at 12:48

2 Answers2

4

The most problematic lines in your code are:

List<int> myResults = Enumerable.Range(1, 50000000).ToList();

and:

foreach (int res in myResults.Shuffle())

Try to avoid at all cost to create 100m objects on the heap. Instead, generate the data continuously and immediately write it to disk without keeping it in memory. Otherwise the memory management and garbage collection become the bottleneck.

And move the shuffling outside of the timed code. I'm pretty sure considerable time is eaten by the shuffling.

So currently you measure the efficiency of .NET garbage collection and the shuffling algorithm instead of what your really want to measure, namely how long it takes to write the CSV files.

Codo
  • 75,595
  • 17
  • 168
  • 206
  • I'm using C# :) myResults are created by another function. I've added this line here to show running code, that someone can copy and run. I use Shuffle to hmm..shuffle results. I don't need timing, all I want to to cut results into 0.2 million per part and save parts to CSV as fast as possible. – Misiu Oct 13 '15 at 12:38
  • Sorry for the confusion regarding C# and Java. But my answer still holds. Your entire data set is materialized on the heap (by the shuffling extension). Therefore, you get the out of memory exception. And the shuffling takes a lot of time and is within the part timed by your stopwatch. – Codo Oct 13 '15 at 12:44
  • I've added `Shuffle` code to my question. I'm using yield there, so I guess this will work fast, but I can be wrong. I must randomize my list, only this method didn't throw MemoryException for me. Any other alternative is welcome. – Misiu Oct 13 '15 at 12:48
  • Your guess is wrong. Even though Shuffle uses yield, it still materializes the entire data set (see source.ToArray()). Also read the first paragraph to this [answer](http://stackoverflow.com/a/1665080/413337) of the same question. – Codo Oct 13 '15 at 12:49
  • Sorry for that, I didn't read that answer good enough. Can I fix shuffle? So it will work better? – Misiu Oct 13 '15 at 12:52
  • The shuffle function is very likely to return a row from the last 10% of the result set within the first 10 rows it returns. So how can it access that far into the result set without materializing it first? I don't think that the shuffle can be rewritten such that it doesn't materialize the result set. Is the shuffling an essential part of your solution or just required for resting? – Codo Oct 13 '15 at 13:05
  • This is required by my solution. – Misiu Oct 13 '15 at 13:06
  • Then I propose you start a new question and ask for efficiently shuffling of huge data sets. Because that's what your problem is mainly about. – Codo Oct 13 '15 at 13:09
  • I've added this line: `var newResults = myResults.Shuffle();` and I iterate over `newResults` this speed thing up. What about lot of IO operations? Can I batch them somehow and put in using statement? – Misiu Oct 13 '15 at 13:10
  • Thanks for suggestion.I'll ask separate question about shuffling. – Misiu Oct 13 '15 at 13:11
1

I ran this code on my notebook, without the shuffle method, and it took 22 seconds. So I think most of the time is probably going into that method.

I'd suggest that you also don't create the data before you use it, because that will use up a lot of memory. Create an enumerating method and yield return the data row by row.

You are also doing a lot of very small IO operations. Rather do fewer larger writes, so try batch up the writes to disk. Use a StringBuilder or something similar to create larger chunks of data to write. You could also look into the BufferedWriter class.

Gordon Rudman
  • 420
  • 4
  • 7
  • I can't write data 'on-fly' when it is generated and I also need to shuffle it. I'll search for alternative. Thanks for StringBuilder and BufferedWriter proposal :) – Misiu Oct 13 '15 at 12:55
  • Ok. I think the big thing then is keep the number of IO operations as low as possible. With traditional hard drives basically every IO operation will cause a HDD seek which could be up to about 10-15ms each. This isn't such a problem with SSDs, but it'll still slow things down a bit. – Gordon Rudman Oct 13 '15 at 13:34
  • could You please show how can I fix my code to remove those unnecessary IO operations? – Misiu Oct 13 '15 at 13:45