How to count lines fast?

Question

I tried unxutils' wc -l but it crashed for 1GB files. I tried this C# code

long count = 0;
using (StreamReader r = new StreamReader(f))
{
    string line;
    while ((line = r.ReadLine()) != null)
    {
        count++;
    }
}

return count;

It reads a 500MB file in 4 seconds

var size = 256;
var bytes = new byte[size];
var count = 0;
byte query = Convert.ToByte('\n');
using (var stream = File.OpenRead(file))
{
    int many;
    do
    {
        many = stream.Read(bytes, 0, size);
        count += bytes.Where(a => a == query).Count();                    
    } while (many == size);
}

Reads in 10 seconds

var count = 0;
int query = (int)Convert.ToByte('\n');
using (var stream = File.OpenRead(file))
{
    int current;
    do
    {
        current = stream.ReadByte();
        if (current == query)
        {
            count++;
            continue;
        }
    } while (current!= -1);
}

Takes 7 seconds

Is anything faster I haven't tried yet?

Are you're sure you're testing the line counting and not the file system? If the first test loads the file so that it's cached, then subsequent tests are going to run much faster. Make sure you're really testing what you think you're testing. — Jim Mischel, May 23 '11 at 19:08
+1 to Jim Mischel. Performance testing is trickier than most people think! — Cheeso, May 23 '11 at 20:53
Some discussion here: https://bytes.com/topic/c-sharp/answers/830061-count-lines-huge-text-files — Arithmomaniac, Jun 24 '16 at 19:13

score 13 · Answer 1 · answered May 23 '11 at 18:45

13

File.ReadLines was introduced in .NET 4.0

var count = File.ReadLines(file).Count();

works in 4 seconds, the same time as the first code snippet

answered May 23 '11 at 18:45

Jader Dias

88,211
155
421
625

That's because it basically does the same thing as your first snippet ;) – SirViver May 23 '11 at 18:50
never use Count(), use Length (File.ReadAllLines(@"yourfile").Length;) // check this solution again , but using Length – cnd May 23 '11 at 18:53
7

@nCdy: This is a really bad suggestion (in this case)! Note the difference: he's using `File.ReadLines()` which actually returns a `IEnumerable` and just does a `yield return` of what basically his first snippet does. `File.ReadAllLines()` would read __all lines__ into memory, which would be horrible performance wise. That said, of course, if you *do* already have an array you should use `Length` instead of `Count()` ;) – SirViver May 23 '11 at 18:58
@SirViver agreed. He don't need to load all lines but if he will not use them. – cnd May 23 '11 at 19:02
1

@nCdy as SirViver said `Exception of type 'System.OutOfMemoryException' was thrown.` – Jader Dias May 23 '11 at 19:04
Do Count and Length only return ints? – Tim Barrass Feb 14 '14 at 11:49

score 12 · Accepted Answer · answered May 23 '11 at 18:46

12

Your first approach does look like the optimal solution already. Keep in mind that you're mostly not CPU bound but limited by the HD's read speed, which at 500MB / 4sec = 125MB/s is already quite fast. The only way to get faster than that is via RAID or using SSDs, not so much via a better algorithm.

answered May 23 '11 at 18:46

SirViver

2,411
15
14

I also figured out that I can estimate the number of lines, by getting the file size and dividing it by the medium size of the first lines. – Jader Dias May 23 '11 at 18:54
@JaderDias: True, but then you only have an estimation, not an actual count. And depending on how the file is structured your estimate could end up being *far* off. You didn't specify what the purpose of the line counting is or what the files typically look like, so more specialized advice cannot really be given. – SirViver May 23 '11 at 19:01
For my CSV file the estimate is accurate enough – Jader Dias May 23 '11 at 19:14

score 2 · Answer 3 · answered May 23 '11 at 18:49

2

Are you just looking for a tool to count lines in a file, and efficiently? If so, try MS LogParser

Something like below will give you number of lines:

LogParser "SELECT count(*) FROM file" -i:TEXTLINE

answered May 23 '11 at 18:49

manojlds

290,304
63
469
417

score 2 · Answer 4 · answered May 23 '11 at 18:57

2

If you really want fast, consider C code.

If this is a command-line utility, it will be faster because it won't have to initialize the CLR or .NET. And, it won't reallocate a new string for each line read from the file, which probably saves time on throughput.

I don't have any files with 1g lines, so I cannot compare. you can try, though:

/*
 * LineCount.c
 *
 * count lines...
 *
 * compile with: 
 *
 *  c:\vc10\bin\cl.exe /O2 -Ic:\vc10\Include -I\winsdk\Include 
 *          LineCount.c -link /debug /SUBSYSTEM:CONSOLE /LIBPATH:c:\vc10\Lib
 *          /LIBPATH:\winsdk\Lib /out:LineCount.exe
 */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>


void Usage(char *appname)
{
    printf("\nLineCount.exe\n");
    printf("  count lines in a text file...\n\n");
    printf("usage:\n");
    printf("  %s <filename>\n\n", appname);
}



int linecnt(char *file)
{
    int sz = 2048;
    char *buf = (char *) malloc(sz);
    FILE *fp = NULL;
    int n= 0;
    errno_t rc = fopen_s(&fp, file, "r");

    if (rc) {
        fprintf(stderr, "%s: fopen(%s) failed: ecode(%d)\n",
                __FILE__, file, rc);
        return -1;
    }

    while (fgets(buf, sz, fp)){
        int r = strlen(buf);
        if (buf[r-1] == '\n')
            n++;
        // could re-alloc here to handle larger lines
    }
    fclose(fp);
    return n;
}

int main(int argc, char **argv)
{
    if (argc==2) {
        int n = linecnt (argv[1]);
        printf("Lines: %d\n", n);
    }
    else {
        Usage(argv[0]);
        exit(1);
    }
}

answered May 23 '11 at 18:57

Cheeso

189,189
101
473
713

probably faster, but I bet the difference is less than 10% – Jader Dias May 23 '11 at 19:00
Try and see. It would be interesting to know. – Cheeso May 23 '11 at 19:02
10 seconds =( running on VS2010 on debug as all the other tests – Jader Dias May 23 '11 at 19:21
Very surprising. I suspect something else is amiss. – Cheeso May 23 '11 at 19:23
16

@Jader: **Whoa** hold on a minute there, **you are running performance tests in debug mode?** Never ever ever do that. You can get completely misleading results. The debugger deliberately de-optimizes your program to improve the debugging experience. In this case it's probably not an issue since you are disk bound, not processor bound, but still, it is a *terrible* programming practice to measure perf in a debugger. – Eric Lippert May 23 '11 at 19:43
@Eric I was expecting someone saying something like that, but I ran all my tests in `Release` mode and nothing changed if you round to seconds (4 seconds in debug mode is still 4 seconds in release mode) – Jader Dias May 23 '11 at 20:32
1

@Jader: Like I said, that's because you got lucky and happened to choose a performance problem that is bound by the speed of the disk hardware. When you try to optimize something that is bound by the speed of the actual code that's a completely different story. – Eric Lippert May 23 '11 at 20:34

score 1 · Answer 5 · edited May 23 '17 at 11:46

1

I think that your answer looks good. The only thing I would add is to play with buffer size. I feel that it may change the performance depending on your buffer size.

Please refer to buffer size at - Optimum file buffer read size?

edited May 23 '17 at 11:46

Community

1
1

answered May 23 '11 at 18:43

istudy0

1,313
5
14
22

I tried different values and anything above 256 will have the same performance, while lower values like 4 are slower. – Jader Dias May 23 '11 at 18:50

score 1 · Answer 6 · answered May 23 '11 at 18:58

Have you tried flex?

%{
long num_lines = 0;
%}
%option 8bit outfile="scanner.c"
%option nounput nomain noyywrap
%option warn

%%
.+ { }
\n { ++num_lines; }
%%
int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
yylex();
printf( "# of lines = %d\n", num_lines );
return 0;
}

Just compile with:

flex -Cf scanner.l 
gcc -O -o lineCount.exe scanner.c

It accepts input on stdin and outputs the number of lines.

How to count lines fast?

6 Answers6

Linked

Related