30

I tried unxutils' wc -l but it crashed for 1GB files. I tried this C# code

long count = 0;
using (StreamReader r = new StreamReader(f))
{
    string line;
    while ((line = r.ReadLine()) != null)
    {
        count++;
    }
}

return count;

It reads a 500MB file in 4 seconds

var size = 256;
var bytes = new byte[size];
var count = 0;
byte query = Convert.ToByte('\n');
using (var stream = File.OpenRead(file))
{
    int many;
    do
    {
        many = stream.Read(bytes, 0, size);
        count += bytes.Where(a => a == query).Count();                    
    } while (many == size);
}

Reads in 10 seconds

var count = 0;
int query = (int)Convert.ToByte('\n');
using (var stream = File.OpenRead(file))
{
    int current;
    do
    {
        current = stream.ReadByte();
        if (current == query)
        {
            count++;
            continue;
        }
    } while (current!= -1);
}

Takes 7 seconds

Is anything faster I haven't tried yet?

Community
  • 1
  • 1
Jader Dias
  • 88,211
  • 155
  • 421
  • 625

6 Answers6

13

File.ReadLines was introduced in .NET 4.0

var count = File.ReadLines(file).Count();

works in 4 seconds, the same time as the first code snippet

Jader Dias
  • 88,211
  • 155
  • 421
  • 625
  • That's because it basically does the same thing as your first snippet ;) – SirViver May 23 '11 at 18:50
  • never use Count(), use Length (File.ReadAllLines(@"yourfile").Length;) // check this solution again , but using Length – cnd May 23 '11 at 18:53
  • 7
    @nCdy: This is a really bad suggestion (in this case)! Note the difference: he's using `File.ReadLines()` which actually returns a `IEnumerable` and just does a `yield return` of what basically his first snippet does. `File.ReadAllLines()` would read __all lines__ into memory, which would be horrible performance wise. That said, of course, if you *do* already have an array you should use `Length` instead of `Count()` ;) – SirViver May 23 '11 at 18:58
  • @SirViver agreed. He don't need to load all lines but if he will not use them. – cnd May 23 '11 at 19:02
  • 1
    @nCdy as SirViver said `Exception of type 'System.OutOfMemoryException' was thrown.` – Jader Dias May 23 '11 at 19:04
  • Do Count and Length only return ints? – Tim Barrass Feb 14 '14 at 11:49
12

Your first approach does look like the optimal solution already. Keep in mind that you're mostly not CPU bound but limited by the HD's read speed, which at 500MB / 4sec = 125MB/s is already quite fast. The only way to get faster than that is via RAID or using SSDs, not so much via a better algorithm.

SirViver
  • 2,411
  • 15
  • 14
  • I also figured out that I can estimate the number of lines, by getting the file size and dividing it by the medium size of the first lines. – Jader Dias May 23 '11 at 18:54
  • @JaderDias: True, but then you only have an estimation, not an actual count. And depending on how the file is structured your estimate could end up being *far* off. You didn't specify what the purpose of the line counting is or what the files typically look like, so more specialized advice cannot really be given. – SirViver May 23 '11 at 19:01
  • For my CSV file the estimate is accurate enough – Jader Dias May 23 '11 at 19:14
2

Are you just looking for a tool to count lines in a file, and efficiently? If so, try MS LogParser

Something like below will give you number of lines:

LogParser "SELECT count(*) FROM file" -i:TEXTLINE
manojlds
  • 290,304
  • 63
  • 469
  • 417
2

If you really want fast, consider C code.

If this is a command-line utility, it will be faster because it won't have to initialize the CLR or .NET. And, it won't reallocate a new string for each line read from the file, which probably saves time on throughput.

I don't have any files with 1g lines, so I cannot compare. you can try, though:

/*
 * LineCount.c
 *
 * count lines...
 *
 * compile with: 
 *
 *  c:\vc10\bin\cl.exe /O2 -Ic:\vc10\Include -I\winsdk\Include 
 *          LineCount.c -link /debug /SUBSYSTEM:CONSOLE /LIBPATH:c:\vc10\Lib
 *          /LIBPATH:\winsdk\Lib /out:LineCount.exe
 */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>


void Usage(char *appname)
{
    printf("\nLineCount.exe\n");
    printf("  count lines in a text file...\n\n");
    printf("usage:\n");
    printf("  %s <filename>\n\n", appname);
}



int linecnt(char *file)
{
    int sz = 2048;
    char *buf = (char *) malloc(sz);
    FILE *fp = NULL;
    int n= 0;
    errno_t rc = fopen_s(&fp, file, "r");

    if (rc) {
        fprintf(stderr, "%s: fopen(%s) failed: ecode(%d)\n",
                __FILE__, file, rc);
        return -1;
    }

    while (fgets(buf, sz, fp)){
        int r = strlen(buf);
        if (buf[r-1] == '\n')
            n++;
        // could re-alloc here to handle larger lines
    }
    fclose(fp);
    return n;
}

int main(int argc, char **argv)
{
    if (argc==2) {
        int n = linecnt (argv[1]);
        printf("Lines: %d\n", n);
    }
    else {
        Usage(argv[0]);
        exit(1);
    }
}
Cheeso
  • 189,189
  • 101
  • 473
  • 713
  • probably faster, but I bet the difference is less than 10% – Jader Dias May 23 '11 at 19:00
  • Try and see. It would be interesting to know. – Cheeso May 23 '11 at 19:02
  • 10 seconds =( running on VS2010 on debug as all the other tests – Jader Dias May 23 '11 at 19:21
  • Very surprising. I suspect something else is amiss. – Cheeso May 23 '11 at 19:23
  • 16
    @Jader: **Whoa** hold on a minute there, **you are running performance tests in debug mode?** Never ever ever do that. You can get completely misleading results. The debugger deliberately de-optimizes your program to improve the debugging experience. In this case it's probably not an issue since you are disk bound, not processor bound, but still, it is a *terrible* programming practice to measure perf in a debugger. – Eric Lippert May 23 '11 at 19:43
  • @Eric I was expecting someone saying something like that, but I ran all my tests in `Release` mode and nothing changed if you round to seconds (4 seconds in debug mode is still 4 seconds in release mode) – Jader Dias May 23 '11 at 20:32
  • 1
    @Jader: Like I said, that's because you got lucky and happened to choose a performance problem that is bound by the speed of the disk hardware. When you try to optimize something that is bound by the speed of the actual code that's a completely different story. – Eric Lippert May 23 '11 at 20:34
1

I think that your answer looks good. The only thing I would add is to play with buffer size. I feel that it may change the performance depending on your buffer size.

Please refer to buffer size at - Optimum file buffer read size?

Community
  • 1
  • 1
istudy0
  • 1,313
  • 5
  • 14
  • 22
  • I tried different values and anything above 256 will have the same performance, while lower values like 4 are slower. – Jader Dias May 23 '11 at 18:50
1

Have you tried flex?

%{
long num_lines = 0;
%}
%option 8bit outfile="scanner.c"
%option nounput nomain noyywrap
%option warn

%%
.+ { }
\n { ++num_lines; }
%%
int main(int argc, char **argv);

int main (argc,argv)
int argc;
char **argv;
{
yylex();
printf( "# of lines = %d\n", num_lines );
return 0;
}

Just compile with:

flex -Cf scanner.l 
gcc -O -o lineCount.exe scanner.c

It accepts input on stdin and outputs the number of lines.

Spencer Rathbun
  • 14,510
  • 6
  • 54
  • 73