Is there a better way to determine the number of lines in a large txt file(1-2 GB)?

Question

I am trying to count all the lines in a txt file, I am using the StreamReader:

public int countLines(string path)
{
    var watch = System.Diagnostics.Stopwatch.StartNew();
    int nlines=0;
    string line;
    StreamReader file = new StreamReader(path);
    while ((line = file.ReadLine()) != null)
    {
        nlines++;
    }
    watch.Stop();
    var elapsedMs = watch.ElapsedMilliseconds;
    Console.Write(elapsedMs)
    // elapsedMs = 3520  --- Tested with a 1.2 Mill txt
    return nlines;
}

Is there a more efficient way to count the number of lines?

It *might* be faster (by a constant) to perform a per-byte search (loop on Read); this could avoid intermediate string creation. — user2864740, Apr 01 '16 at 23:31
To avoid allocating and then throwing away a whole bunch of strings, it might be more efficient to call `file.Read()` and count the number of carriage-return and/or linefeed characters. — Michael Liu, Apr 01 '16 at 23:31
If you don't need the filecontents (other than the number of lines) you could remove the `string line` variable and just do `while (file.ReadLine() != null) nlines++;` — derpirscher, Apr 01 '16 at 23:32
@derpirscher While perhaps more clear of intent, it will have absolutely no bearing on the final speed. — user2864740, Apr 01 '16 at 23:35
Your code essentially counts the number of times that a pointer advances to the next 0×0A (since that also covers 0×0c and 0×0A combinations). Run it as a raw pointer advancement while incrementing count and see if that improves efficiency over the StreamReader overhead. I'm rusty on this so I'm inviting review. — , Apr 01 '16 at 23:47

dreamlax · Accepted Answer · 2016-04-02T01:11:30.637

10

I'm just thinking out loud here, but chances are performance is I/O bound and not CPU bound. In any case, I'm wondering if interpreting the file as text may be slowing things down as it will have to convert between the file's encoding and string's native encoding. If you know the encoding is ASCII or compatible with ASCII, you might be able to get away with just counting the number of times a byte with the value 10 appears (which is the character code for a linefeed).

What if you had the following:

FileStream fs = new FileStream("path.txt", FileMode.Open, FileAccess.Read, FileShare.None, 1024 * 1024);

long lineCount = 0;
byte[] buffer = new byte[1024 * 1024];
int bytesRead;

do
{
    bytesRead = fs.Read(buffer, 0, buffer.Length);
    for (int i = 0; i < bytesRead; i++)
        if (buffer[i] == '\n')
            lineCount++;
}
while (bytesRead > 0);

My benchmark results for 1.5GB text file, timed 10 times, averaged:

StreamReader approach, 4.69 seconds
File.ReadLines().Count() approach, 4.54 seconds
FileStream approach, 1.46 seconds

edited Apr 02 '16 at 01:11

answered Apr 01 '16 at 23:39

dreamlax

93,976
29
161
209

I'm not sure what you did to be downvoted, to me that looks like it would count lines without loading the whole file in memory. (Though it might be more efficient to read a buffer of bytes instead of one byte at a time.) – zneak Apr 01 '16 at 23:44
@zneak: I thought wrapping the `FileStream` inside a `BufferedStream` would help with buffering, but I honestly don't know enough about `BufferedStream` objects to know whether it is helping or hindering in this case. EDIT: it turns out, `FileStream` objects are already buffered, so using a `BufferedStream` is unnecessary in this case. – dreamlax Apr 01 '16 at 23:50
3

Can you use `char` as a variable name as it is a C# reserved keyword, perhaps `@byte` or `currentByte` would be better? You may also find reading a buffer of bytes to be more performant. Either way, +1 for avoiding those unnecessary `string` allocations. – Lukazoid Apr 01 '16 at 23:59
@Lukazoid: I just typed the code in without even testing it, but yes, `char` is likely an unusable variable name, I'll fix that up! :) – dreamlax Apr 02 '16 at 00:03
2

@dreamlax you are assuming the new line is represented as `LF (\n) (10)` or `CRLF (\r\n)`. What about in the case where the new line is defined as carriage return `\r (13)`? – Apr 02 '16 at 00:09
@dreamlax I suppose you could also replace the `10` with `\n` for clarity but that's just a personal preference. – Lukazoid Apr 02 '16 at 00:09
@Lukazoid: I was on the fence about whether to use `10` or `'\n'` but I think `'\n'` looks nicer now. – dreamlax Apr 02 '16 at 00:12
@NimaAra: Yes, I suppose this might not work if someone is still running MacOS 9 or older. – dreamlax Apr 02 '16 at 00:13
1

I generated a file with 1M lines that all read "This is line #n". With the OP's method, it counted 1M in 124ms. With the ReadByte method, it took 220ms. I stored the comparison value with `int newline = (int)Encoding.ASCII.GetBytes(Environment.NewLine)[0];` Just wanted to toss my results out there. – Chris Fannin Apr 02 '16 at 00:27
@ChrisFannin: Might need to increase the size of the input. When I benchmarked with 1.5GB files the results were different for me. – dreamlax Apr 02 '16 at 00:31
@dreamlax - I'll give that a shot to see what happens. – Chris Fannin Apr 02 '16 at 00:32
@ChrisFannin: Try the updated code, and also make sure to build Release rather than Debug – dreamlax Apr 02 '16 at 00:56
@dreamlax - That is so odd. I generated a 1.86GB file (1M lines of 2K characters), and `StreamReader` took ~9s while `FileStream` took ~18s to count all 1M. That was with multiple runs. *shrug* – Chris Fannin Apr 02 '16 at 00:57
@ChrisFannin: Did you run Release build (and with the updated code)? – dreamlax Apr 02 '16 at 00:58
@dreamlax - I just saw your recent comment and update. That buffer might help. I'm going to give it a go! – Chris Fannin Apr 02 '16 at 00:59
I realised I was testing input over a network drive. I thought 30+ seconds to read 1.5GB was a bit odd. I copied the file over to a local drive and that sped things up considerably, however, the `FileStream` approach was by far still the fastest. – dreamlax Apr 02 '16 at 00:59
This is the txt that i'm testing:https://drive.google.com/file/d/0Bwy19LIX4H2RcUx2c1BSWUlMcTA/view and looks like FileStream is slower than StreamReader by far, took ~5s with StreamReader and ~9s with FileStream – Brayan Henao Apr 02 '16 at 01:04
I think you have introduced a bug in your last edit, have you tried outputting the linecount? I am getting 0 on a 150,000,000 lines file ~ 6gig – MaYaN Apr 02 '16 at 01:07
That comparison should be `if (buffer[i] == '\n')`, but yes, it was the fastest. OP = ~9s. FS1 = ~18s. FS2 = ~7s. – Chris Fannin Apr 02 '16 at 01:11
@MaYaN: Sorry, I'm typing this on a Mac and writing the code on a Windows computer, so I'm back and forth all the time (Mac has a bigger screen), it was a typo when copying the code back. I tested with `citiesTour_400.txt` and on my machine, SR is roughly 2.2sec and FS is roughly 0.70. – dreamlax Apr 02 '16 at 01:13
2

I have updated my post, I am getting a very different result :-) – MaYaN Apr 02 '16 at 01:25
1

Upvoted. With `citiesTour_400.txt`, I got SR 3.754s and FS 2.751s. I'm certain it would go faster if I optimized release and possibly disable my antivirus. :-) – Chris Fannin Apr 02 '16 at 01:46

MaYaN · Answer 2 · 2016-04-07T06:56:23.003

5

You already have the appropriate solution but you can simplify all your code to:

var lineCount = File.ReadLines(@"C:\MyHugeFile.txt").Count();

Benchmarks

I am not sure how dreamlax achieved his benchmark results but here is something so that anyone can reproduce on their machine; you can just copy-paste into LINQPad.

First let us prepare our input file:

var filePath = @"c:\MyHugeFile.txt";

for (int counter = 0; counter < 5; counter++)
{
    var lines = new string[30000000];

    for (int i = 0; i < lines.Length; i++)
    {
        lines[i] = $"This is a line with a value of: {i}";
    }

    File.AppendAllLines(filePath, lines);
}

This should produce a 150 million lines file which is roughly 6 GB.

Now let us run each method:

void Main()
{
    var filePath = @"c:\MyHugeFile.txt";
    // Make sure you clear windows cache!
    UsingFileStream(filePath);

    // Make sure you clear windows cache!
    UsingStreamReaderLinq(filePath);

    // Make sure you clear windows cache!
    UsingStreamReader(filePath);
}

private void UsingFileStream(string path)
{
    var sw = Stopwatch.StartNew();
    using (var fs = new FileStream(path, FileMode.Open, FileAccess.Read))
    {
        long lineCount = 0;
        byte[] buffer = new byte[1024 * 1024];
        int bytesRead;

        do
        {
            bytesRead = fs.Read(buffer, 0, buffer.Length);
            for (int i = 0; i < bytesRead; i++)
                if (buffer[i] == '\n')
                    lineCount++;
        }
        while (bytesRead > 0);       
        Console.WriteLine("[FileStream] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
    }
}

private void UsingStreamReaderLinq(string path)
{
    var sw = Stopwatch.StartNew();
    var lineCount = File.ReadLines(path).Count();
    Console.WriteLine("[StreamReader+LINQ] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
}

private void UsingStreamReader(string path)
{
    var sw = Stopwatch.StartNew();
    long lineCount = 0;
    string line;
    using (var file = new StreamReader(path))
    {
        while ((line = file.ReadLine()) != null) { lineCount++; }
        Console.WriteLine("[StreamReader] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
    }
}

Which results in:

[FileStream] - Read: 150,000,000 in 00:00:37.3397443

[StreamReader+LINQ] - Read: 150,000,000 in 00:00:33.8842190

[StreamReader] - Read: 150,000,000 in 00:00:34.2102178

Update

Running with optimization ON results in:

[FileStream] - Read: 150,000,000 in 00:00:18.1636374

[StreamReader+LINQ] - Read: 150,000,000 in 00:00:33.3173354

[StreamReader] - Read: 150,000,000 in 00:00:32.3530890

edited Apr 07 '16 at 06:56

answered Apr 01 '16 at 23:40

MaYaN

6,683
12
57
109

This will load the entire file into RAM, creating a plethora of string objects. – dreamlax Apr 01 '16 at 23:41
4

@dreamlax It won't! you are confusing `ReadLines` with `ReadAllLines` the former returns an `IEnumerable`. Refer to: http://stackoverflow.com/questions/119559/determine-the-number-of-lines-within-a-text-file – MaYaN Apr 01 '16 at 23:42
Ooh good point! But it is still creating a lot of string objects. – dreamlax Apr 01 '16 at 23:44
2

A little bit slower than StreamReader way (3619 Milliseconds) But thanks anyway :) – Brayan Henao Apr 01 '16 at 23:52
3

@Brayan, benchmarking IO is not as simple as running the code twice and comparing the results. Specially when you are dealing with a disk. At the minimum you need to clear the content of `Windows Cached files` then run them multiple times and take the average. You can use RAMMap to clear the cache, more info: http://stackoverflow.com/questions/478340/clear-file-cache-to-repeat-performance-testing – MaYaN Apr 02 '16 at 00:17
@MaYaN: When I run the debug build, the speed is roughly the same/slower, but when I run the release build, the speed is considerably different. Did you try the `citiesTour_400.txt` file? I will generate a file the same way you have and see how I go. – dreamlax Apr 02 '16 at 01:30
@dreamlax where do I get the `citiesTour_400.txt` from? – MaYaN Apr 02 '16 at 01:31
It's the file that @ChrisFannin mentioned in a comment on my answer. – dreamlax Apr 02 '16 at 01:34
@MaYaN https://drive.google.com/file/d/0Bwy19LIX4H2RcUx2c1BSWUlMcTA/view – Brayan Henao Apr 02 '16 at 01:35
@MaYaN: Even with the generated file I am still getting substantially better results with FileStream (but only of course with a Release build). With your answer I get an average of 25.38 seconds, and with my answer I get an average of 18.21 seconds – dreamlax Apr 02 '16 at 01:39
@MaYaN: Although, I created my 150,000,000 line file a bit differently. Rather than allocating 30,000,000 strings at once I just used a `StreamWriter` with `WriteLine`, but I verified that I had a 6+GB file with 150,000,000 lines in it still – dreamlax Apr 02 '16 at 01:40
2

@dreamlax, I just updated the result this time with optimization `ON` and your method was almost 2x faster :-) my only objection is the lack of support for `carriage return (\r)` – MaYaN Apr 02 '16 at 01:42
1

@MaYaN: Indeed, your answer is much safer. For files with mixed line endings mine may give a different result. I deal a lot with PostScript files and it's common to see mixed line endings there (embedded files may have one line ending while the overall PostScript file may have another). I also deal a lot with Macs (particular older ones) and from time to time I do encounter a file with `\r` line endings but it is rare (and getting rarer). – dreamlax Apr 02 '16 at 01:53

Is there a better way to determine the number of lines in a large txt file(1-2 GB)?

2 Answers2

Benchmarks

Update

Linked

Related