ASP.net C# : How to read 20 to 200 GB file line by line using File.ReadLines(fileName).GetEnumerator()?

Question

We are trying with below code.

 public static int SplitFile(string fileName, string tmpFolder, List<string> queue, int splitSize = 100000)
    {
        int chunk = 0;
        if (!Directory.Exists(tmpFolder))
            Directory.CreateDirectory(tmpFolder);
        using (var lineIterator = File.ReadLines(fileName).GetEnumerator())
        {
            bool stillGoing = true;
            for (chunk = 0; stillGoing; chunk++)
            {
                stillGoing = WriteChunk(lineIterator, splitSize, chunk, tmpFolder, queue);
            }
        }
        return chunk;
    }

    private static bool WriteChunk(IEnumerator<string> lineIterator,
                                   int splitSize, int chunk, string tmpFolder, List<string> queue)
    {
        try
        {

            //int tmpChunkSize = 1000;
            //int tmpChunkInc = 0;
            string splitFile = Path.Combine(tmpFolder, "file" + chunk + ".txt");

            using (var writer = File.CreateText(splitFile))
            {
                queue.Add(splitFile);
                for (int i = 0; i < splitSize; i++)
                {
                    if (!lineIterator.MoveNext())
                    {
                        return false;
                    }
                    writer.WriteLine(lineIterator.Current);

                }
            }

            return true;
        }
        catch (Exception)
        {

            throw;
        }

    }

It creates around 36 text files (around 800 MB), but starting throwing "Out of memory exception" in creation of 37th File at lineIterator.MoveNext().

While lineIterator.Current shows the value in debugger.

Have you tried with some arrays and removing items as you read/write them? — lcssanches, Jun 03 '13 at 13:52
Here's an alternative using an iterator to read line by line that doesn't try to pull the whole file in memory: http://stackoverflow.com/questions/1271225/c-sharp-reading-a-file-line-by-line — neontapir, Jun 03 '13 at 13:57
Depending on how long the lines are, you'll probably run into large object heap fragmentation problems with this method — Earlz, Jun 03 '13 at 14:02
@Earlz That does seem most likely, but these would be some **very** long lines. — Joel Coehoorn, Jun 03 '13 at 14:19
What are you reading that you have 20GB of text? Are these binary files you are using `ReadLines` on? — Scott Chamberlain, Jun 03 '13 at 14:35
@JoelCoehoorn if he is using the default `splitSize` value (**100000** ) it is not necessary to have very long lines to reach the LOH — polkduran, Jun 03 '13 at 14:50
@akfkmupiwu, what is the value of your `splitSize` variable? Have you tried with a lower value than 100000? — polkduran, Jun 03 '13 at 14:52
@JoelCoehoorn depending on the file format, 8000 characters isn't "huge" — Earlz, Jun 03 '13 at 15:15
@polkduran Splitsize is how many lines per file, but each line is read/written to the new stream one at a time, such that only one line is in ram at a time. That will only hit the LOH if individual lines are large enough — Joel Coehoorn, Jun 03 '13 at 15:17
@JoelCoehoorn er. Yea, forgot. 85K not 8K. Although, it *maybe* actually only has to be 42500 characters, since characters are actually 2 bytes. I was assuming this would be a machine readable format though, I wouldn't say such a line size is unreasonable for a CSV file, for instance — Earlz, Jun 03 '13 at 15:19
Isn't this a candidate to use Task parallel library? And, if this is in asp.net, what would the user do while the splitting is on? — shahkalpesh, Jun 03 '13 at 15:37
@JoelCoehoorn I was thinking about an internal `buffer` used by the `StreamWriter` but the default buffer size is 1024 I think. — polkduran, Jun 03 '13 at 15:39
@akfkmupiwu, just to be sure the writer is not having a `buffer` size problem, can you try to make a `Flush` after the `WriteLine` call. I know this will affect performance but just for check if the `StreamWriter` isn't having problems. — polkduran, Jun 03 '13 at 15:42
All, I have tried this code on two separate machine of same config, it worked on another one. Then I published this on problematic system and it worked perfectly. Even till now I am not sure why this does not work with Visual Studio 2010 in debug mode, another info is, we are using MVC 3. I am logged in using Admin user on system. — Anil, Jun 03 '13 at 15:57

score 0 · Answer 1 · edited May 23 '17 at 12:30

0

As It s a huge file you should read it Seek and ReadBytes methods of BinaryReader.

You can see a simple example here. After you use the ReadBytes check for the last new lines and write the process file in certain amount of lines you read. Don t write every line you read and also don t keep all the data in the memory.

The rest is in your hands.

Maybe it is realted to that one When does File.ReadLines free resources

IEnumerable doesn't inherit from IDisposable because typically, the class that implements it only gives you the promise of being enumerable, it hasn't actually done anything yet that warrants disposal.

edited May 23 '17 at 12:30

Community

1
1

answered Jun 03 '13 at 13:54

Onur Topal

3,042
1
24
41

4

The whole point of [File.ReadLines](http://msdn.microsoft.com/en-us/library/dd383503.aspx) is specifically that it doesn't read the whole file into memory. There's something else going wrong here. – Lasse V. Karlsen Jun 03 '13 at 14:01
Based on your quote it seems you stopped reading the answer after that first paragraph. I suggest you keep reading. Also note that when the file handle is released is different from how long lines of text are stored in memory. You can read a line of text and then release that memory before releasing the file handle. – Servy Jun 03 '13 at 15:32

ASP.net C# : How to read 20 to 200 GB file line by line using File.ReadLines(fileName).GetEnumerator()?

1 Answers1