Fastest way to read rows form 100+ GB files fast in C#

Question

working on a project that will load 100+ GB text files and one of the processes is to count rows in the specified file. I have to do it the following way to not get out of memory exception. is there a faster way or what is the most efficient way to multitask this. (i know that you can do something like run it on 4 threads and divide the combined output by 4. don't know the most efficient way)

uint loadCount2 = 0;
foreach (var line in File.ReadLines(currentPath))
{
    loadCount2++;
}

planed on running the program on a server with 4 dual cores CPUs and 40 GB RAM when I have fixed the location for it. currently, it runs on a temporary small 4 core 8GB RAM server. (don't know how threading would behave over multiple CPUs.)

I tested a lot of your suggestions.

            Stopwatch sw2 = Stopwatch.StartNew();
            {
                using (FileStream fs = File.Open(json, FileMode.Open))
                    CountLinesMaybe(fs);
            }



            TimeSpan t = TimeSpan.FromMilliseconds(sw2.ElapsedMilliseconds);
            string answer = string.Format("{0:D2}h:{1:D2}m:{2:D2}s:{3:D3}ms", t.Hours, t.Minutes, t.Seconds, t.Milliseconds);
            Console.WriteLine(answer);
            sw2.Restart();
            loadCount2 = 0;


            Parallel.ForEach(File.ReadLines(json), (line) =>
            {
                loadCount2++;
            });


            t = TimeSpan.FromMilliseconds(sw2.ElapsedMilliseconds);
            answer = string.Format("{0:D2}h:{1:D2}m:{2:D2}s:{3:D3}ms", t.Hours, t.Minutes, t.Seconds, t.Milliseconds);
            Console.WriteLine(answer);
            sw2.Restart();
            loadCount2 = 0;

            foreach (var line in File.ReadLines(json))
            {
                loadCount2++;
            }

             t = TimeSpan.FromMilliseconds(sw2.ElapsedMilliseconds);
             answer = string.Format("{0:D2}h:{1:D2}m:{2:D2}s:{3:D3}ms", t.Hours, t.Minutes, t.Seconds, t.Milliseconds);
            Console.WriteLine(answer);
            sw2.Restart();
            loadCount2 = 0;

            int query = (int)Convert.ToByte('\n');
            using (var stream = File.OpenRead(json))
            {
                int current;
                do
                {
                    current = stream.ReadByte();
                    if (current == query)
                    {
                        loadCount2++;
                        continue;
                    }
                } while (current != -1);
            }

             t = TimeSpan.FromMilliseconds(sw2.ElapsedMilliseconds);
             answer = string.Format("{0:D2}h:{1:D2}m:{2:D2}s:{3:D3}ms", t.Hours, t.Minutes, t.Seconds, t.Milliseconds);
            Console.WriteLine(answer);
            Console.ReadKey();

    private const char CR = '\r';
    private const char LF = '\n';
    private const char NULL = (char)0;

    public static long CountLinesMaybe(Stream stream)
    {
        //Ensure.NotNull(stream, nameof(stream));

        var lineCount = 0L;

        var byteBuffer = new byte[1024 * 1024];
        const int BytesAtTheTime = 4;
        var detectedEOL = NULL;
        var currentChar = NULL;

        int bytesRead;
        while ((bytesRead = stream.Read(byteBuffer, 0, byteBuffer.Length)) > 0)
        {
            var i = 0;
            for (; i <= bytesRead - BytesAtTheTime; i += BytesAtTheTime)
            {
                currentChar = (char)byteBuffer[i];

                if (detectedEOL != NULL)
                {
                    if (currentChar == detectedEOL) { lineCount++; }

                    currentChar = (char)byteBuffer[i + 1];
                    if (currentChar == detectedEOL) { lineCount++; }

                    currentChar = (char)byteBuffer[i + 2];
                    if (currentChar == detectedEOL) { lineCount++; }

                    currentChar = (char)byteBuffer[i + 3];
                    if (currentChar == detectedEOL) { lineCount++; }
                }
                else
                {
                    if (currentChar == LF || currentChar == CR)
                    {
                        detectedEOL = currentChar;
                        lineCount++;
                    }
                    i -= BytesAtTheTime - 1;
                }
            }

            for (; i < bytesRead; i++)
            {
                currentChar = (char)byteBuffer[i];

                if (detectedEOL != NULL)
                {
                    if (currentChar == detectedEOL) { lineCount++; }
                }
                else
                {
                    if (currentChar == LF || currentChar == CR)
                    {
                        detectedEOL = currentChar;
                        lineCount++;
                    }
                }
            }
        }

        if (currentChar != LF && currentChar != CR && currentChar != NULL)
        {
            lineCount++;
        }
        return lineCount;
    }

the resultant shows great progress but I hoped to reach 20 minutes. I would like to them this on my stronger server to see the effect on having more CPUs.

the second run returned: 23 min, 25 min, 22 min, 29 min

meaning that the methods don't really make any difference. (was not able to take a screenshot because I removed the pause and the program continued by clearing screen)

If a row is a line, you could count the `\n`s with LINQ I think. — Furkan Kambay, Dec 02 '18 at 02:32
Also, if you create those files, how about a separate file to store the line count as its not going to change between method calls? (caching the number, basically) — Furkan Kambay, Dec 02 '18 at 02:34
Calling File.ReadLines("filename").Count() may result in an Overflow if the row count is higher than Int.Max. — Mateus Schneiders, Dec 02 '18 at 02:36
As for threading: get thread count, split enumerable accordingly (or cap the thread count if too many) and run on each. But of course, this should be done just once and every time you add to that file, you should increment the line count in a separate file — Furkan Kambay, Dec 02 '18 at 02:38
Unfortunately, this question was marked as a duplicate of something, it is not really duplicate of. But you can find the solution here https://stackoverflow.com/questions/119559/determine-the-number-of-lines-within-a-text-file , skip the accepted answer and roll down to answer that points to Nima Ara article. — Antonín Lejsek, Dec 02 '18 at 04:08

score 2 · Answer 1 · answered Dec 02 '18 at 03:03

A ReadByte (and comparing with newline char) based approach might be faster than ReadLine.For example, for a file that is closer to a GB

stopwatch = System.Diagnostics.Stopwatch.StartNew();
uint count = 0;
int query = (int)Convert.ToByte('\n');
using (var stream = File.OpenRead(filepath))
{
    int current;
    do
    {
        current = stream.ReadByte();
        if (current == query)
        {
            count++;
            continue;
        }
    } while (current!= -1);
}
Console.WriteLine($"Using ReadByte,Time : {stopwatch.Elapsed.TotalMilliseconds},Count: {r}");

Using ReadByte,Time : 8174.5661,Count: 7555107

stopwatch = System.Diagnostics.Stopwatch.StartNew();
uint loadCount2 = 0;
foreach (var line in File.ReadLines(filepath))
{
    loadCount2++;
}
Console.WriteLine($"Using ReadLines, Time : {stopwatch.Elapsed.TotalMilliseconds},Count: {r}");

Using ReadLines, Time : 27303.835,Count: 7555107

Ignoring completely wrong comment by @theWongfonSemicolon this is actually reasonable solution as long as OP can guarantee that files are not saved as UTF16 (Unicode), ASCII or UTF8 are fine. Similar to one shown in https://stackoverflow.com/a/50508830/477420 (linked as duplicate) which is more elaborate with the same drawbacks. — Alexei Levenkov, Dec 02 '18 at 04:53

score 1 · Accepted Answer · answered Dec 02 '18 at 04:26

When you start working with big data, you need a more powerful computing system to make things run faster. If you want speed, increase RAM to hold the entire data in memory. Add a NVMe SSD and store the data file on it for faster read performance.

Software wise, just read the file in large chunks and loop over the buffer checking each byte counting newline characters. You aren't doing any processing on a text line adding or removing characters, checking for patterns, etc. ReadLine has too much overhead to it in the creation of its data structures to hold the lines on the fly.

You don't need that overhead but rather just one large fixed sized buffer allocated once, data read in, and iterated over looking for newline. Write it in C for faster processing too.

Fastest way to read rows form 100+ GB files fast in C#

2 Answers2