Best way to read csv file in C# to improve time efficiency

Question

I have the following code to read in a large file, say with over a million rows. I am using Parallel and Linq approaches. Is there a better way to do it? If yes, then how?

        private static void ReadFile()
        {
            float floatTester = 0;
            List<float[]> result = File.ReadLines(@"largedata.csv")
                .Where(l => !string.IsNullOrWhiteSpace(l))
                .Select(l => new { Line = l, Fields = l.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries) })
                .Select(x => x.Fields
                              .Where(f => Single.TryParse(f, out floatTester))
                              .Select(f => floatTester).ToArray())
                .ToList();

            // now get your totals
            int numberOfLinesWithData = result.Count;
            int numberOfAllFloats = result.Sum(fa => fa.Length);
            MessageBox.Show(numberOfAllFloats.ToString());
        }

        private static readonly char[] Separators = { ',', ' ' };

        private static void ProcessFile()
        {
            var lines = File.ReadAllLines("largedata.csv");
            var numbers = ProcessRawNumbers(lines);

            var rowTotal = new List<double>();
            var totalElements = 0;

            foreach (var values in numbers)
            {
                var sumOfRow = values.Sum();
                rowTotal.Add(sumOfRow);
                totalElements += values.Count;
            }
            MessageBox.Show(totalElements.ToString());
        }

        private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
        {
            var numbers = new List<List<double>>();
            /*System.Threading.Tasks.*/
            Parallel.ForEach(lines, line =>
            {
                lock (numbers)
                {
                    numbers.Add(ProcessLine(line));
                }
            });
            return numbers;
        }

        private static List<double> ProcessLine(string line)
        {
            var list = new List<double>();
            foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
            {
                double i;
                if (Double.TryParse(s, out i))
                {
                    list.Add(i);
                }
            }
            return list;
        }

        private void button1_Click(object sender, EventArgs e)
        {
            Stopwatch stopWatchParallel = new Stopwatch();
            stopWatchParallel.Start();
            ProcessFile();
            stopWatchParallel.Stop();
            // Get the elapsed time as a TimeSpan value.
            TimeSpan ts = stopWatchParallel.Elapsed;

            // Format and display the TimeSpan value.
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts.Hours, ts.Minutes, ts.Seconds,
                ts.Milliseconds / 10);
            MessageBox.Show(elapsedTime);

            Stopwatch stopWatchLinQ = new Stopwatch();
            stopWatchLinQ.Start();
            ReadFile();
            stopWatchLinQ.Stop();
            // Get the elapsed time as a TimeSpan value.
            TimeSpan ts2 = stopWatchLinQ.Elapsed;

            // Format and display the TimeSpan value.
            string elapsedTimeLinQ = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts2.Hours, ts.Minutes, ts.Seconds,
                ts2.Milliseconds / 10);
            MessageBox.Show(elapsedTimeLinQ);
        }

In general Parallell is for CPU bound task. As you lock in the Parallell.ForEach you get no scaleability and thus I can't see that it does help you significantly. — Just another metaprogrammer, Aug 10 '12 at 07:14
Better ask this at [codereview](http://codereview.stackexchange.com/faq). — Gert Arnold, Aug 10 '12 at 07:48

Vitaliy Fedorchenko · Answer 1 · 2018-11-28T13:52:20.943

5

Recently I faced the problem of parsing large CSV files as fast as possible for the same purpose: data aggregation and metrics calculation (in my case final goal was pivot table generation). I tested most popular CSV readers but found that they are just not designed for parsing CSV files with million of rows or more; JoshClose's CsvHelper is fast, but finally I was able to process CSV as a stream in 2x-4x times faster!

My approach is based on 2 assumptions:

avoid creation of strings when possible as this is waste of memory and CPU (= increases GC payload). Instead of that, parser result can be represented as set of 'field value' descriptors that hold only start and end position in buffer + some metadata (quoted value flag, number of double-quotes inside value), and string value is constructed only when needed.
use circular char[] buffer to read csv line to avoid excessive data copying
no abstractions, minimal methods calls - this enables effective JIT-optimizations (say, avoid array length checks). No LINQ, no iterators (foreach) - as for is much more efficient.

Real life usage numbers (pivot table by 200MB CSV file, 17 columns, only 3 columns are used to build a crosstab):

my custom CSV reader: ~1.9s
CsvHelper: ~6.1s

--- update ---

I've published my library that works as described above on github: https://github.com/nreco/csv

Nuget package: https://www.nuget.org/packages/NReco.Csv/

edited Nov 28 '18 at 13:52

answered Dec 09 '17 at 21:05

Vitaliy Fedorchenko

8,447
3
37
34

Did you publish your library on GitHub? – rburte Nov 23 '18 at 21:22
2

@rburte not yet, are you interested in this lib? Code is stable and works fine in prod env of my product (SeekTable), so I can publish it on github / nuget if someone else also needs this ultra-fast / memory-efficient CSV parser. – Vitaliy Fedorchenko Nov 24 '18 at 08:32
Yes, I would appreciate that. One part of my workflow requires a load and scan, so... this is a logjam at the moment. – rburte Nov 27 '18 at 03:00
Have you plans to update your library with using of new Span and Memory classes? – Akmal Salikhov Jun 02 '20 at 12:34
1

@AkmalSalikhov issue for this is already present: https://github.com/nreco/csv/issues/1 In fact this is just an alternative API methods to access chars in internal char[] buffer without copying (result is ReadOnlySpan), will do it on occasion - or someone may propose required changes with PR. I don't expect significant performance gain from this - as this may only help to avoid unnecessary allocations. Internally parser works directly with char[] buffer + indexes, which is pretty similar to what Span does under the hood. – Vitaliy Fedorchenko Jun 02 '20 at 15:04

score 3 · Answer 2 · edited May 23 '17 at 12:09

You can the built-in OleDb for that..

public void ImportCsvFile(string filename)
{
    FileInfo file = new FileInfo(filename);

    using (OleDbConnection con = 
            new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"" +
            file.DirectoryName + "\";
            Extended Properties='text;HDR=Yes;FMT=Delimited(,)';"))
    {
        using (OleDbCommand cmd = new OleDbCommand(string.Format
                                  ("SELECT * FROM [{0}]", file.Name), con))
        {
            con.Open();

            // Using a DataTable to process the data
            using (OleDbDataAdapter adp = new OleDbDataAdapter(cmd))
            {
                DataTable tbl = new DataTable("MyTable");
                adp.Fill(tbl);

                //foreach (DataRow row in tbl.Rows)

                //Or directly make a list
                List<DataRow> list = dt.AsEnumerable().ToList();
            }
        }
    }
}

See this and this for further reference.

Looks good and it may save my bacon, but there is one update. Microsoft.Jet.OLEDB.4.0 was deprecated in 2002 and later broken by KB 2017. Use "Microsoft.ACE.OLEDB.12.0". This I have used Excel 8.0 successfully. — John Pittaway, Sep 19 '18 at 19:51

score 2 · Answer 3 · answered Aug 10 '12 at 08:53

2

Check out the Fast CSV Reader.

answered Aug 10 '12 at 08:53

score 0 · Answer 4 · answered Aug 10 '12 at 07:44

You should take a look at CsvHelper => https://github.com/JoshClose/CsvHelper/

It allows you to map your .csv file with a class, so you can use your .csv file as an object. Try it and then try to apply your parallel operation to see if you have better perfs.

Here is a sample code I have for a project :

 using (var csv = new CsvReader(new StreamReader(filePath, Encoding.Default)))
 {
            csv.Configuration.Delimiter = ';'; 
            csv.Configuration.ClassMapping<LogHeaderMap, LogHeader>(); 


            var data = csv.GetRecords<LogHeader>();

            foreach (var entry in data.OrderByDescending(x => x.Date))
            {
               //process
            }
 }

`csv.Configuration.ClassMapping(); ` => confused about `LogHeaderMap`. What is it? is it related to `LogHeader `class? if so, what's the relation? — MD TAREQ HASSAN, Nov 16 '17 at 01:28

Best way to read csv file in C# to improve time efficiency

4 Answers4