2

I have a data file in CSV (Comma-Separated-Value) format that has about 50 million lines in it.

Each line is read into a string, parsed, and then used to fill in the fields of an object of type FOO. The object then gets added to a List(of FOO) that ultimately has 50 million items.

That all works, and fits in memory (at least on an x64 machine), but its SLOW. It takes like 5 minutes every time load and parse the file into the list. I would like to make it faster. How can I make it faster?

The important parts of the code are shown below.

Public Sub LoadCsvFile(ByVal FilePath As String)
    Dim s As IO.StreamReader = My.Computer.FileSystem.OpenTextFileReader(FilePath)

    'Find header line
    Dim L As String
    While Not s.EndOfStream
        L = s.ReadLine()
        If L = "" Then Continue While 'discard blank line
        Exit While
    End While
    'Parse data lines
    While Not s.EndOfStream
        L = s.ReadLine()
        If L = "" Then Continue While 'discard blank line          
        Dim T As FOO = FOO.FromCSV(L)
       Add(T)
    End While
    s.Close()
End Sub


Public Class FOO
    Public time As Date
    Public ID As UInt64
    Public A As Double
    Public B As Double
    Public C As Double

    Public Shared Function FromCSV(ByVal X As String) As FOO
        Dim T As New FOO
        Dim tokens As String() = X.Split(",")
        If Not DateTime.TryParse(tokens(0), T.time) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid ISO 8601 timestamp")
        End If
        If Not UInt64.TryParse(tokens(1), T.ID) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid ID")
        End If
        If Not Double.TryParse(tokens(2), T.A) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid Format for A")
        End If
        If Not Double.TryParse(tokens(3), T.B) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid Format for B")
        End If
        If Not Double.TryParse(tokens(4), T.C) Then
            Throw New Exception("Could not convert CSV to FOO:  Invalid Format for C")
        End If
        Return T
    End Function
End Class

I did some benchmarking and here are the results.

  • The complete algorithm above took 314 seconds to load the whole file and put the objects into the list.
  • With the body of FromCSV() reduced to just returning a new object of type FOO with default field values, the whole process took 84 seconds. Therefore it appears that processing the line of text into the object fields is taking 230 seconds (73% of the total time).
  • Doing everything but parsing the ISO 8601 date string takes 175 seconds. Therefore it appears that processing the date string takes 139 seconds, which is 60% of the text processing time, just for that one field.
  • Just reading the lines in the file without any processing or object creating takes 41 seconds.
  • Using StreamReader.ReadBlock to read the whole file in chunks of about 1KB takes 24s, but its a minor improvement in the grand scheme of things and probably not worth the added complexity. In order to use TryParse I would now need to manually create the temporary strings rather than using String.Split().

At this point the only path I see is to just display status to the user every few seconds so they don't wonder if the program is frozen or something.

UPDATE
I created two new functions. One can save the dataset from memory into a binary file using System.IO.BinaryWriter. The other function can load that binary file back into memory using System.IO.BinaryReader. The binary versions were considerably faster than CSV versions, and the binary files take up much less space.

Here are the benchmark results (same dataset for all tests):

  • LOAD CSV: 340s
  • SAVE CSV: 312s
  • SAVE BIN: 29s
  • LOAD BIN: 41s
  • CSV FILE SIZE: 3.86GB
  • BIN FILE SIZE: 1.63GB
user4574
  • 310
  • 4
  • 8
  • 1
    Have you looked at [Microsoft Docs on Processing CSVs](https://learn.microsoft.com/en-us/dotnet/visual-basic/developing-apps/programming/drives-directories-files/how-to-read-from-comma-delimited-text-files)? – Jeff Zeitlin May 13 '21 at 14:18
  • 1
    [CsvHelper](https://joshclose.github.io/CsvHelper/) – Jimi May 13 '21 at 14:33
  • 1
    Maybe you can try parsing each line in a [Parallel.For Loop](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-write-a-simple-parallel-for-loop). You might need to load all the lines into a List or array first though. – Étienne Laneville May 13 '21 at 15:59
  • Here's a similar question: [C# - Parallelizing While Loop with StreamReader causing High CPU](https://stackoverflow.com/questions/49217299/c-sharp-parallelizing-while-loop-with-streamreader-causing-high-cpu) – Étienne Laneville May 13 '21 at 16:05
  • 1) The code is doing the `.Split(",")` twice. 2) I agree with Jimi that using CsvHelper could help. 3) If you're going to have 50 million items, telling the user the line number that an exception happened on might be useful. – Andrew Morton May 13 '21 at 17:21
  • @AndrewMorton The Extra String.Split() was a copy paste error. I removed it from the question. Indicating line number for errors is a good idea also, so I will do that. Manually opening the file and finding the line is another story since its too big for Excel or most text editors. – user4574 May 13 '21 at 17:27
  • @user4574 I guessed that could be the case as you'd been experimenting with it. Most text editors will be able to open the file: EditPad and Notepad++ certainly can (and they both have a go to line number function). – Andrew Morton May 13 '21 at 17:36
  • @AndrewMorton I though Notepad++ would be able to open it also, but the file is 3.8GB and Notepad++ v7.9.1 - 64-bit said it was too large to open when I tried. The only editor I have that was able to open it was emacs. And while emacs could open it, it was sort of laggy whenever I tried to edit or scroll the view. – user4574 May 13 '21 at 19:59
  • @user4574 I created a 50 million-line CSV file of data, with the fields you showed, for investigation. I noticed it was a little smaller than 4 GB - is 4 GB a limit for your program? – Andrew Morton May 13 '21 at 20:11
  • @user4574 EditPad Pro can open the file. Maybe there is a 64-bit version of Notepad++. You mention emacs - are you working under Linux? – Andrew Morton May 13 '21 at 20:13
  • @AndrewMorton I am not working under Linux, but I got a windows version of emacs specifically because it supports really huge files. I don't intend for my program to be limited to 4GB of data. I have 50 million lines now, but up to a few hundred million are likely use cases in the future. – user4574 May 13 '21 at 20:36
  • @user4574 A database would be a better option for querying that amount of data: is that an option for you? – Andrew Morton May 13 '21 at 20:39
  • @AndrewMorton adding database support in the future is worth looking into, but I am not going to attempt it at this point. – user4574 May 13 '21 at 20:45
  • @user4574 Are you using an HDD or SSD? The former often as completion times in the order of the next religious holiday compared to SSDs. – Andrew Morton May 13 '21 at 20:45
  • @AndrewMorton Right now I am working off of a reasonably good SSD. – user4574 May 13 '21 at 20:46
  • @user4574 Where does the data come from? Is using [Protocol Buffers](https://developers.google.com/protocol-buffers/docs/csharptutorial) to serialize/deserialize the data a possibility? – Andrew Morton May 13 '21 at 20:47
  • @JeffZeitlin Well I looked at the article you linked to and it had some good info. The TextFieldParser was really easy to use. The run time was 1100 seconds which is about 4X slower than my original implementation. – user4574 May 13 '21 at 20:47
  • @user4574 Do you have to read that amount of data every time your program starts, or is the data added to in increments? (Realistically, for that amount of data, you are already in the part of the future where a database is needed ;) – Andrew Morton May 13 '21 at 20:59
  • @AndrewMorton In this case I am using all 50 million lines every time I process the data. I could load it all at the beginning and process it, or I could load it incrementally as it processes. The second approach is a little bit better on memory and avoids making the user wait for a while up front (in exchange for making the processing apparently take longer). – user4574 May 13 '21 at 21:10
  • 1
    @user4574 Have you tried with `TryParseExact` and giving it the format that the date is in? It might make it slightly faster. Also, if the data is machine-generated, using `Parse` instead of `TryParse` could help, again only slightly. – Andrew Morton May 14 '21 at 07:52

1 Answers1

4

I have a lot of experience with CSV, and the bad news is that you aren't going to be able to make this a whole lot faster. CSV libraries aren't going to be of much assistance here. The difficult problem with CSV, that libraries attempt to handle, is dealing with fields that have embedded commas, or newlines, which require quoting and escaping. Your dataset doesn't have this issue, since none of the columns are strings.

As you have discovered, the bulk of the time is spent in the parse methods. Andrew Morton had a good suggestion, using TryParseExact for DateTime values can be a quite a bit faster than TryParse. My own CSV library, Sylvan.Data.Csv (which is the fastest available for .NET), uses an optimization where it parses primitive values directly out of the stream read buffer without converting to string first (only when running on .NET core), that can also speed things up a bit. However, I wouldn't expect it to be possible to cut the processing time in half while sticking with CSV.

Here is an example of using my library, Sylvan.Data.Csv to process the CSV in C#.

static List<Foo> Read(string file)
{
    // estimate of the average row length based on Andrew Morton's 4GB/50m
    const int AverageRowLength = 80;

    var textReader = File.OpenText(file);
    // specifying the DateFormat will cause TryParseExact to be used.
    var csvOpts = new CsvDataReaderOptions { DateFormat = "yyyy-MM-ddTHH:mm:ss" };
    var csvReader = CsvDataReader.Create(textReader, csvOpts);

// estimate number of rows to avoid growing the list.
    var estimatedRows = (int)(textReader.BaseStream.Length / AverageRowLength);            
    var data = new List<Foo>(estimatedRows);

    while (csvReader.Read())
    {
        if (csvReader.RowFieldCount < 5) continue;
        var item = new Foo()
        {
            time = csvReader.GetDateTime(0),
            ID = csvReader.GetInt64(1),
            A = csvReader.GetDouble(2),
            B = csvReader.GetDouble(3),
            C = csvReader.GetDouble(4)
        };
        data.Add(item);
    }
    return data;
}

I'd expect this to be somewhat faster than your current implementation, so long as you are running on .NET core. Running on .NET framework the difference, if any, wouldn't be a significant. However, I don't expect this to be acceptably fast for your users, it will still likely take tens of seconds, or minutes to read the whole file.

Given that, my advice would be to abandon CSV altogether, which means you can abandon parsing which is what is slowing things down. Instead, read and write the data in binary form. Your data records have a nice property, in that they are fixed width: each record contains 5 fields that are 8 bytes (64bit) wide, so each record requires exactly 40 bytes in binary form. 50m x 40 = 2GB. So, assuming Andrew Morton's estimate of 4GB for the CSV is correct, moving to binary will halve the storage needs. Immediately, that means there is half as much disk IO needed to read the same data. But beyond that, you won't need to parse anything, the binary representation of the value will essentially be copied directly to memory.

Here are some examples of how to do this in C# (don't know VB very well, sorry).


static List<Foo> Read(string file)
{
    var stream = File.OpenRead(file);
    // the exact number of records can be determined by looking at the length of the file.
    var recordCount = stream.Length / 40;
    var data = new List<Foo>(recordCount);
    var br = new BinaryReader(stream);
    for (int i = 0; i < recordCount; i++)
    {
        var ticks = br.ReadInt64();
        var id = br.ReadInt64();
        var a = br.ReadDouble();
        var b = br.ReadDouble();
        var c = br.ReadDouble();
        var f = new Foo()
        {
            time = new DateTime(ticks),
            ID = id,
            A = a,
            B = b,
            C = c,
        };
        data.Add(f);
    }
    return data;
}

static void Write(List<Foo> data, string file)
{
    var stream = File.Create(file);
    var bw = new BinaryWriter(stream);
    foreach(var item in data)
    {
        bw.Write(item.time.Ticks);
        bw.Write(item.ID);
        bw.Write(item.A);
        bw.Write(item.B);
        bw.Write(item.C);
    }
}

This should almost certainly be an order of magnitude faster than a CSV-based solution. The question then becomes: is there some reason that you must use CSV? If the source of the data is out of your control, and you must use CSV, I would then ask: will the data file change every time, or will it only be appended to with new data? If it is appended to, I would investigate a solution where each time the app starts convert only the new section of appended CSV data and add it to a binary file that you will then load everything from. Then you only have to pay the cost of processing the new CSV data each time, and will load everything quickly from the binary form.

This could be made even faster by creating fixed layout struct (Foo), allocating an array of them, and using span-based trickery to read the array data directly from the FileStream. This can be done because all of your data elements are "blittable". This would be the absolute fastest way to load this data into your program. Start with the BinaryReader/Writer and if you find that still isn't fast enough, then investigate this.

If you find this solution to work, I'd love to hear the results.

MarkPflug
  • 28,292
  • 8
  • 46
  • 54
  • 1
    There is a large dataset that grows over time. The data may be received in either as CSV file or JSON from a web service. The same growing dataset is analyzed over and over. So converting the data to binary and then just adding to it makes a lot of sense. I will probably take that approach. – user4574 May 14 '21 at 15:49