11

I have a huge .csv file, to be specific a .TAB file with 29 million rows and the file size is around 600 MB. I would need to read this into an IEnumerable collection.

I have tried CsvHelper, GenericParser, and few other solutions but always ending up with an Out of Memory exception

Please suggest a way to do this

I have tried

var deliveryPoints = new List<Point>();

using (TextReader csvreader1 = File.OpenText(@"C:\testfile\Prod\PCDP1705.TAB")) //StreamReader csvreader1 = new StreamReader(@"C:\testfile\Prod\PCDP1705.TAB"))
using (var csvR1 = new CsvReader(csvreader1, csvconfig))
{
     csvR1.Configuration.RegisterClassMap<DeliveryMap>();
     deliveryPoints = csvR1.GetRecords<Point>().ToList();
}

using (GenericParser parser = new GenericParser())
{
     parser.SetDataSource(@"C:\testfile\Prod\PCDP1705.TAB");

     parser.ColumnDelimiter = '\t';
     parser.FirstRowHasHeader = false;
     //parser.SkipStartingDataRows = 10;
     //parser.MaxBufferSize = 4096;
     //parser.MaxRows = 500;
     parser.TextQualifier = '\"';

     while (parser.Read())
     {
         var address = new Point();
         address.PostCodeID = int.Parse(parser[0]);
         address.DPS = parser[1];
         address.OrganisationFlag = parser[2];
         deliveryPoints.Add(address);
     }
}

and

var deliveryPoints = new List<Point>();
csvreader = new StreamReader(@"C:\testfile\Prod\PCDP1705.TAB");
csv = new CsvReader(csvreader, csvconfig);

while (csv.Read())
{
     var address = new Point();
     address.PostCodeID = int.Parse(csv.GetField(0));
     address.DPS = csv.GetField(1);                
     deliveryPoints.Add(address);
}
Frank
  • 66,179
  • 8
  • 96
  • 180
Lee
  • 139
  • 1
  • 1
  • 8
  • 7
    In all of those cases are you sure you're streaming results rather than putting them all into a list or some similar in-memory data structure? Please show the CsvHelper code you're using, for example. – Jon Skeet Jun 05 '17 at 05:47
  • var Points = new List(); using (TextReader csvreader1 = File.OpenText(@"C:\testfile\Prod\PCDP1705.TAB")) using (var csvR1 = new CsvReader(csvreader1, csvconfig)) { csvR1.Configuration.RegisterClassMap(); deliveryPoints = csvR1.GetRecords().ToList(); } – Lee Jun 05 '17 at 05:58
  • 4
    Please *edit the question* rather than putting code in the question. But as I suspected, you're trying to load all the data in one go. That's not the same as "I need to read this to IEnumerable collection" which implies you're able to stream. – Jon Skeet Jun 05 '17 at 05:59
  • Possible duplicate of [How to read a large (1 GB) txt file in .NET?](https://stackoverflow.com/questions/4273699/how-to-read-a-large-1-gb-txt-file-in-net) – pcdev Jun 05 '17 at 06:07
  • Do you really need 29 million rows in memory all at the same time? How about using SqlBulk insert to insert the data into a properly indexed table and they smartly query the table for the actual rows you want/need a much more efficient plan. – Mark Kram Mar 29 '18 at 23:54
  • When I worked on the Postoffice Address File (which coincidentally has 29 million rows!), I found it easier to reduce the number of columns to start with so the overall memory footprint was reduced. One way to do this is use a StreamReader and StreamWriter in concert so you read in one line at a time, reduce the number of colums then write the line out. I eventually squeezed the PAF into 100 MB of data (compressed but searchable) in the memory of an iPhone and that included postcodes and street name and numbers. – David Bolton Aug 26 '20 at 18:34

3 Answers3

14

The problem is that you are loading entire file into memory. You can compile your code to x64 which will increase memory limit for your program rapidly, but it is not recommended if you can avoid loading entire file into memory.

Notice that calling ToList() forces the CsvReader to load entire file into memory at once:

csvR1.GetRecords<Point>().ToList();

But this will load only one line at a time:

foreach(var record in csvR1.GetRecords<Point>())
{
    //do whatever with the single record
}

This way you can process files of unlimited size

Liero
  • 25,216
  • 29
  • 151
  • 297
10

No need to use 3rd party software. Use Net Library methods

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Data;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            StreamReader csvreader = new StreamReader(@"C:\testfile\Prod\PCDP1705.TAB");
            string inputLine = "";
            while ((inputLine = csvreader.ReadLine()) != null)
            {
                var address = new Point();
                string[] csvArray = inputLine.Split(new char[] { ',' });
                address.postCodeID = int.Parse(csvArray[0]);
                address.DPS = csvArray[1];
                Point.deliveryPoints.Add(address);
            }

            //add data to datatable
            DataTable dt = new DataTable();
            dt.Columns.Add("Post Code", typeof(int));
            dt.Columns.Add("DPS", typeof(string));

            foreach (Point point in Point.deliveryPoints)
            {
                dt.Rows.Add(new object[] { point.postCodeID, point.DPS });
            }

        }
    }
    public class Point
    {
        public static List<Point> deliveryPoints = new List<Point>();
        public int postCodeID { get; set; }
        public string DPS { get; set; }

    }
}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • Thanks jdweng. I have tried the above solution, but it crashed with out of memory exception in 16 million records – Lee Jun 05 '17 at 23:36
  • Never seen this happen before. This is very basic code. How old is you PC? Try running with Task Manage opened to see if it is this application that is hogging memory or something else on your PC. – jdweng Jun 06 '17 at 02:43
  • What component in the Point class would be dumped to a data table? Ultimately, I need to get in to a DataGridView. –  Mar 29 '18 at 23:26
  • This solution solved my problem too. I have about 2.6 million rows of data, 300MB in csv file size. I also tried CSVHelper, and it crashed after processing about 900K rows. No need to set – Lei Shi Apr 13 '18 at 17:29
-1

It worked by running in x64 mode, and by adding <gcAllowVeryLargeObjects enabled="true" /> in app.config.

Lee
  • 139
  • 1
  • 1
  • 8
  • 2
    You didn't remedy the cause of the problem, you just steered around it - for the time being! You should really consider the answer of @jdweng - it scales well and doesn't chew up your resources getting there! – Monza Jun 09 '17 at 02:46
  • You need to specify: ` ` –  Mar 30 '18 at 23:02