4

The easiest approach to count line numbers in a file can be this:

while(!feof(fp))
{
  ch = fgetc(fp);
  if(ch == '\n')
  {
    lines++;
  }
}

But now the requirement is that I have to count the number of lines in large files. It will have a performance impact.

Is there a better approach?

import this
  • 517
  • 2
  • 7
  • 21

4 Answers4

6

For fastest I/O, you usually want to read/write in multiples of the block size of your filesystem/OS.

You can query the block size by calling statfs or fstatfs on your file or file descriptor (read the man pages).

The struct statfs has a field f_bsize and sometimes also f_iosize:

optimal transfer block size

The f_bsize field exists on all POSIX systems, AFAIK. On Mac OS X and iOS, there's also f_iosize which is the one you'd prefer on these platforms (but f_bsize works on Mac OS X/iOS as well and should usually be same as f_iosize, IIRC).

struct statfs fsInfo = {0};
int fd = fileno(fp); // Get file descriptor from FILE*.
long optimalSize;

if (fstatfs(fd, &fsInfo) == -1) {
    // Querying failed! Fall back to a sane value, for example 8kB or 4MB.
    optimalSize = 4 * 1024 * 1024;
} else {
    optimalSize = fsInfo.f_bsize;
}

Now allocate a buffer of that size and read (using read or fread) blocks of that size. Then iterate this in-memory block and count the number of newlines. Repeat until EOF.

A different approach is the one @Ioan proposed: use mmap to map the file into memory and iterate that buffer. This probably gives you optimal performance as the kernel can read the data in the most efficient way, but this might fail for files that are "too large" while the approach I've described above always works with files of arbitrary size and gives you near-optimal performance.

Community
  • 1
  • 1
DarkDust
  • 90,870
  • 19
  • 190
  • 224
  • The specifics of `statfs` is platform-specific. Maybe the OP targets MS Windows, rather than some POSIX-ish environment. – Tadeusz A. Kadłubowski Jun 25 '14 at 12:06
  • @TadeuszA.Kadłubowski: Question is flagged as `c` only but uses `fgetc` and `feof`, so I'm assuming he's on a POSIX system. But even on Windows the concept holds true: you don't want to do a large amount of small reads (very, very slow as the overhead is very large), instead you want to do as few reads of the most optimal size possible. I'm pretty sure there's a Windows way to query the preferred I/O size. – DarkDust Jun 25 '14 at 12:08
  • perhaps mention that taking a 2**20 Byte buffer is pretty much guaranteed to be a multiple of the blocksize. Perhaps take two of them, one for an asynchronous read and the other one for processing now. – Deduplicator Jun 25 '14 at 12:09
  • @Deduplicator: The point is not to _guess_, but to _ask the kernel_. For example, I've seen 4MB block sizes on iPads so your 1MB buffer would already be too small. – DarkDust Jun 25 '14 at 12:15
  • @TadeuszA.Kadłubowski i only want to use it on POSIX environments –  Jun 25 '14 at 12:15
3

"Is there be a better approach?"

It is not a good idea to use !feof(fp) as a terminating condition. You are better of with

while ((ch = fgetc(fp)) != EOF)

And check for newlines(as mentioned take into account all the possible newline characters) inside the loop.

More here: http://faq.cprogramming.com/cgi-bin/smartfaq.cgi?answer=1046476070&id=1043284351

zbs
  • 671
  • 4
  • 16
1

I would recommend trying memory mapped IO to allow the OS to optimize disk IO (likely your biggest bottleneck) while you simply count the lines. Also consider a line may be indicated by any one of 4 possibilities: \r, \n, \r\n, end-of-file.

Ioan
  • 2,382
  • 18
  • 32
0

Unless file doesn't contain any header with meta-data like line numbers, finding this number has linear complexity. Also keep in mind "\n" is not universal newline character.

jacekmigacz
  • 789
  • 12
  • 22
  • but these file formats are already known and created by one of our applications. So i know ""\n"" is the newline character there. –  Jun 25 '14 at 12:02
  • It's pretty obvious that the *complexity* is linear for any algorithm (for this problem). But specifics of performance goes well beyond the complexity of the algorithm used. You haven't really answered the question. – P.P Jun 25 '14 at 12:05
  • 1
    depends on how the file was opened: text or binary mode. – Deduplicator Jun 25 '14 at 12:06