1

I have a tricky situation here. I am trying to avoid hitting out of memory exceptions when writing a large CSV dataset to an H5 file via HDFDotNet API. However, I get an out of memory exception when trying to do a second loop through my file data that is the same size as the first iteration, even though the first one works and the second does not and the amount of memory being used should be much less than the ~1.2GB ceiling. I've determined the size of the chunks I want to read in at a time and the size of the chunks I need to write at a time due to limitations with the API. The CSV file is about 105k lines long by 500 columns wide.

private void WriteDataToH5(H5Writer h5WriterUtil)
{
    int startRow = 0;
    int skipHeaders = csv.HasColumnHeaders ? 1 : 0;
    int readIntervals = (-8 * csv.NumColumns) + 55000;
    int numTaken = readIntervals;

    while (numTaken == readIntervals)
    {
        int timeStampCol = HasTimestamps ? 1 : 0;

        var readLines = File.ReadLines(this.Filepath)
            .Skip(startRow + skipHeaders).Take(readIntervals)
            .Select(s => s.Split(new char[] { ',').Skip(timeStampCol)
            .Select(x => Convert.ToSingle(x)).ToList()).ToList();

        //175k is  max number of cells that can be written at one time
        //(unconfirmed via API, tested and seems to be definitely less than 200k and 175k works)

        int writeIntervals = Convert.ToInt32(175000/csv.NumColumns);

        for (int i = 0; i < readIntervals; i += writeIntervals)
        {
            long[] startAt = new long[] { startRow, 0 };
            h5WriterUtil.WriteTwoDSingleChunk(readLines.Skip(i).Take(writeIntervals).ToList()
                , DatasetsByNamePair[Tuple.Create(groupName, dataset)], startAt);

            startRow += writeIntervals;
        }

        numTaken = readLines.Count;
        GC.Collect();
    }
}

I end up hitting my out of memory exception on the second pass through of the readlines section

var readLines = File.ReadLines(this.Filepath)
            .Skip(rowStartAt).Take(numToTake)
            .Select(s => s.Split(new char[] { ',' }).Skip(timeStampCol)
            .Select(x => Convert.ToSingle(x)).ToList()).ToList();

In this case, my read intervals var would come out to 50992 and the writeIntervals would come out to about 350. Thanks!

DDushaj
  • 127
  • 8
  • Avoid parsing tasks whenever possible. See https://stackoverflow.com/questions/9642055/csv-parsing-options-with-net You'll have issues with escaped data, multiline data etc. Use someone else's work – Sten Petrov Jul 14 '16 at 17:56
  • I ended up figuring out the problem, it had to do with the way I was writing to the H5 file via the API. I had previously set the chunk size on it to {1,1} default and trying to write a much larger chunk was causing it to crash. Also, setting the compression level to between 6-8 helped alot too. – DDushaj Aug 01 '16 at 13:26

1 Answers1

2

You do a lot of unnecessary allocations:

var readLines = File.ReadLines(this.Filepath)
            .Skip(rowStartAt).Take(numToTake)
            .Select(s => s.Split(new char[] { ',' }) //why you need to split here ?
             .Skip(timeStampCol)
            .Select(x => Convert.ToSingle(x)).ToList()).ToList(); //why 2 time ToList() ?

File.ReadLines return Enumerator, hence simply iterate over it, after split every single line, skip required column, and recover value you need for saving.

What about of memory exception while still using less then 1.2GB of memory, consider following:

  1. You may try to compile for x64 (still re-architect your code first !)
  2. Regardless what you do, there is still limit on single collection size, which is (true) 2GB.
  3. You may be allocating more then stack can offer you, which 1 MB for 32-bit processes and 4 MB for 64-bit processes. Why is stack size in C# exactly 1 MB?
Community
  • 1
  • 1
Tigran
  • 61,654
  • 8
  • 86
  • 123
  • I need to split each value from the comma to break up the csv into "cells" in a 2D structure to write the H5 file, which stores datasets and takes in the structure to write as 2D. – DDushaj Jul 14 '16 at 17:50
  • Can't you split just before writing into hdf5 ? – Tigran Jul 14 '16 at 17:52
  • Yes, but would that provide any advantages? It still has to be done at one point or another – DDushaj Jul 14 '16 at 17:58
  • It will do for one line only, current line and saved. Non for all lines in file at once – Tigran Jul 14 '16 at 18:02