3

I am assigned a project that requires a c# console application that will be used to manipulate a text file. The text file is a bcp table dump. The program should be able to:

  1. Split the file into multiple file(s) based on a column given by the user
  2. Include or exclude the split column from the output

Currently, I am reading in the file as such:

var groupQuery = from name in File.ReadAllLines(fileName)
                                .Skip(skipHeaderRow)
                             let n = name.Split(delimiterChars)
                             group name by n[index] into g
                             // orderby g.Key
                             select g;

I am afraid I might run into memory issues since some of the files can have over 2 million reacords and each row is about 2617 bytes

user990423
  • 1,397
  • 2
  • 12
  • 32
  • `ReadAllLines` will load the entire file into memory (alarm bells!) and will fail if the file is larger than 2GiB. Not to mention how long it will take. – Dai Sep 21 '15 at 17:47
  • 1
    Consult [Extremely Large File parse](http://stackoverflow.com/questions/26247952/). – Dour High Arch Sep 21 '15 at 17:49
  • I will upload this file to a DB and then handle the records from there (you can even use LINQ at that point .Take(1000) .Skip(1000) and process records in chunks of 1000 at a time. – ProgrammerV5 Sep 21 '15 at 18:08

3 Answers3

2

If you are confident that you program will only need to sequentially access... the bcp dump file, use StreamReader class to read the file. This class is optimized for sequential access and it opens the file as a stream, therefore memory issues should not bother you. Further, you can increase the buffer size of your stream by initializing from a different constructor of this class to have a larger chunk in memory to work with.


If you want to have random access to your file in pieces... go for Memory Mapped Files. make sure to create view accessor over a limited section of the file. The example code given at the link of MMFs explains how to create a small view over a large file.


Edit: I had the code for using MMFs in my answer but I have removed it now as I realized... Even though in reality group by is lazy, it is also a non-streaming LINQ operator. Therefore, it will have to read the entire bcp dump of yours to finally give you the results. This implies:

  1. StreamReader is a clearly a better approach for you. Make sure you increase the buffer to max possible;
  2. Your LINQ will take some time when it will hit the group by operator and will only come back to life after entire file read has been finished.
displayName
  • 13,888
  • 8
  • 60
  • 75
1

Try using Buffered Streams to read/write files without completely loading them into memory.

using (FileStream fs = File.Open(inputFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) {
        using (StreamReader sr = new StreamReader(fs)) {
            string line = sr.ReadLine();
            string lineA = null;
            string lineB = null;
            while ((line != null)) {
                // Split your line here into lineA and lineB
                // and write using buffered writer.
                line = sr.ReadLine();
            }
        }
}

(from here)

The idea is to read the file line by line without loading the entire thing in your memory, split it however way you want and then write the splitted lines, line by line into your output files.

Community
  • 1
  • 1
Shreyas Kapur
  • 669
  • 4
  • 15
  • FYI - explicitly adding buffered streams around file and network streams went out years ago : http://stackoverflow.com/questions/492283/when-to-use-net-bufferedstream-class - and you can build a StreamReader directly off a a file, so there is some potential for simplifying this code. – The other other Alan Sep 21 '15 at 19:03
0

Don't reinvent the wheel. Consider using a library like FileHelpers.

http://www.filehelpers.net/example/QuickStart/ReadWriteRecordByRecord/

var engine = new FileHelperAsyncEngine<Customer>();

using(engine.BeginReadFile(fileName))
{
    var groupQuery =
        from o in engine
        group name by o.CustomerId into g
        // orderby g.Key
        select g;   

    foreach(Customer cust in engine)
    {
        Console.WriteLine(cust.Name);
    }
}

You will still run into memory problems with your group and order functions because all records need to be in memory to be grouped and ordered.

joelnet
  • 13,621
  • 5
  • 35
  • 49