6

I have the following code to read in a large file, say with over a million rows. I am using Parallel and Linq approaches. Is there a better way to do it? If yes, then how?

        private static void ReadFile()
        {
            float floatTester = 0;
            List<float[]> result = File.ReadLines(@"largedata.csv")
                .Where(l => !string.IsNullOrWhiteSpace(l))
                .Select(l => new { Line = l, Fields = l.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries) })
                .Select(x => x.Fields
                              .Where(f => Single.TryParse(f, out floatTester))
                              .Select(f => floatTester).ToArray())
                .ToList();

            // now get your totals
            int numberOfLinesWithData = result.Count;
            int numberOfAllFloats = result.Sum(fa => fa.Length);
            MessageBox.Show(numberOfAllFloats.ToString());
        }

        private static readonly char[] Separators = { ',', ' ' };

        private static void ProcessFile()
        {
            var lines = File.ReadAllLines("largedata.csv");
            var numbers = ProcessRawNumbers(lines);

            var rowTotal = new List<double>();
            var totalElements = 0;

            foreach (var values in numbers)
            {
                var sumOfRow = values.Sum();
                rowTotal.Add(sumOfRow);
                totalElements += values.Count;
            }
            MessageBox.Show(totalElements.ToString());
        }

        private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
        {
            var numbers = new List<List<double>>();
            /*System.Threading.Tasks.*/
            Parallel.ForEach(lines, line =>
            {
                lock (numbers)
                {
                    numbers.Add(ProcessLine(line));
                }
            });
            return numbers;
        }

        private static List<double> ProcessLine(string line)
        {
            var list = new List<double>();
            foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
            {
                double i;
                if (Double.TryParse(s, out i))
                {
                    list.Add(i);
                }
            }
            return list;
        }

        private void button1_Click(object sender, EventArgs e)
        {
            Stopwatch stopWatchParallel = new Stopwatch();
            stopWatchParallel.Start();
            ProcessFile();
            stopWatchParallel.Stop();
            // Get the elapsed time as a TimeSpan value.
            TimeSpan ts = stopWatchParallel.Elapsed;

            // Format and display the TimeSpan value.
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts.Hours, ts.Minutes, ts.Seconds,
                ts.Milliseconds / 10);
            MessageBox.Show(elapsedTime);

            Stopwatch stopWatchLinQ = new Stopwatch();
            stopWatchLinQ.Start();
            ReadFile();
            stopWatchLinQ.Stop();
            // Get the elapsed time as a TimeSpan value.
            TimeSpan ts2 = stopWatchLinQ.Elapsed;

            // Format and display the TimeSpan value.
            string elapsedTimeLinQ = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                ts2.Hours, ts.Minutes, ts.Seconds,
                ts2.Milliseconds / 10);
            MessageBox.Show(elapsedTimeLinQ);
        }
Next Door Engineer
  • 2,818
  • 4
  • 20
  • 33

4 Answers4

5

Recently I faced the problem of parsing large CSV files as fast as possible for the same purpose: data aggregation and metrics calculation (in my case final goal was pivot table generation). I tested most popular CSV readers but found that they are just not designed for parsing CSV files with million of rows or more; JoshClose's CsvHelper is fast, but finally I was able to process CSV as a stream in 2x-4x times faster!

My approach is based on 2 assumptions:

  • avoid creation of strings when possible as this is waste of memory and CPU (= increases GC payload). Instead of that, parser result can be represented as set of 'field value' descriptors that hold only start and end position in buffer + some metadata (quoted value flag, number of double-quotes inside value), and string value is constructed only when needed.
  • use circular char[] buffer to read csv line to avoid excessive data copying
  • no abstractions, minimal methods calls - this enables effective JIT-optimizations (say, avoid array length checks). No LINQ, no iterators (foreach) - as for is much more efficient.

Real life usage numbers (pivot table by 200MB CSV file, 17 columns, only 3 columns are used to build a crosstab):

  • my custom CSV reader: ~1.9s
  • CsvHelper: ~6.1s

--- update ---

I've published my library that works as described above on github: https://github.com/nreco/csv

Nuget package: https://www.nuget.org/packages/NReco.Csv/

Vitaliy Fedorchenko
  • 8,447
  • 3
  • 37
  • 34
  • Did you publish your library on GitHub? – rburte Nov 23 '18 at 21:22
  • 2
    @rburte not yet, are you interested in this lib? Code is stable and works fine in prod env of my product (SeekTable), so I can publish it on github / nuget if someone else also needs this ultra-fast / memory-efficient CSV parser. – Vitaliy Fedorchenko Nov 24 '18 at 08:32
  • Yes, I would appreciate that. One part of my workflow requires a load and scan, so... this is a logjam at the moment. – rburte Nov 27 '18 at 03:00
  • Have you plans to update your library with using of new Span and Memory classes? – Akmal Salikhov Jun 02 '20 at 12:34
  • 1
    @AkmalSalikhov issue for this is already present: https://github.com/nreco/csv/issues/1 In fact this is just an alternative API methods to access chars in internal char[] buffer without copying (result is ReadOnlySpan), will do it on occasion - or someone may propose required changes with PR. I don't expect significant performance gain from this - as this may only help to avoid unnecessary allocations. Internally parser works directly with char[] buffer + indexes, which is pretty similar to what Span does under the hood. – Vitaliy Fedorchenko Jun 02 '20 at 15:04
3

You can the built-in OleDb for that..

public void ImportCsvFile(string filename)
{
    FileInfo file = new FileInfo(filename);

    using (OleDbConnection con = 
            new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"" +
            file.DirectoryName + "\";
            Extended Properties='text;HDR=Yes;FMT=Delimited(,)';"))
    {
        using (OleDbCommand cmd = new OleDbCommand(string.Format
                                  ("SELECT * FROM [{0}]", file.Name), con))
        {
            con.Open();

            // Using a DataTable to process the data
            using (OleDbDataAdapter adp = new OleDbDataAdapter(cmd))
            {
                DataTable tbl = new DataTable("MyTable");
                adp.Fill(tbl);

                //foreach (DataRow row in tbl.Rows)

                //Or directly make a list
                List<DataRow> list = dt.AsEnumerable().ToList();
            }
        }
    }
} 

See this and this for further reference.

Community
  • 1
  • 1
Robin Maben
  • 22,194
  • 16
  • 64
  • 99
  • Looks good and it may save my bacon, but there is one update. Microsoft.Jet.OLEDB.4.0 was deprecated in 2002 and later broken by KB 2017. Use "Microsoft.ACE.OLEDB.12.0". This I have used Excel 8.0 successfully. – John Pittaway Sep 19 '18 at 19:51
2

Check out the Fast CSV Reader.

0

You should take a look at CsvHelper => https://github.com/JoshClose/CsvHelper/

It allows you to map your .csv file with a class, so you can use your .csv file as an object. Try it and then try to apply your parallel operation to see if you have better perfs.

Here is a sample code I have for a project :

 using (var csv = new CsvReader(new StreamReader(filePath, Encoding.Default)))
 {
            csv.Configuration.Delimiter = ';'; 
            csv.Configuration.ClassMapping<LogHeaderMap, LogHeader>(); 


            var data = csv.GetRecords<LogHeader>();

            foreach (var entry in data.OrderByDescending(x => x.Date))
            {
               //process
            }
 }
h1ghfive
  • 196
  • 7
  • `csv.Configuration.ClassMapping(); ` => confused about `LogHeaderMap`. What is it? is it related to `LogHeader `class? if so, what's the relation? – MD TAREQ HASSAN Nov 16 '17 at 01:28