34

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!

glglgl
  • 89,107
  • 13
  • 149
  • 217
Adrienne
  • 465
  • 1
  • 4
  • 6
  • 3
    In general it would probably be faster to treat the file as binary data, read through it in reasonably-sized chunks (say, 4KB at a time), and count the `\n` characters in each chunk as you go. – aroth Mar 09 '12 at 05:11
  • 4
    This is not better performing than your naive solution, but fyi the pythonic way to write what you have here would be simply `with open(fname) as f: print sum(1 for line in f)` – wim Mar 09 '12 at 05:37
  • 1
    aroth: Thanks for the tip, I should look into that. wim: great, thanks, that's much shorter... – Adrienne Mar 09 '12 at 05:43
  • Take a look at [rawbigcount](http://stackoverflow.com/a/27517681/3420199) at Michael Bacon's answer. It may be helpful you! – Diogo Feb 03 '16 at 13:20

5 Answers5

47

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r") as f:
    print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
    print (sum(bl.count("\n") for bl in blocks(f)))
SU3
  • 5,064
  • 3
  • 35
  • 66
glglgl
  • 89,107
  • 13
  • 149
  • 217
  • 1
    Just as one data point, a read of a large file of about 51 MB went from about a minute using the naive approach to under one second using this approach. – M Katz Dec 16 '13 at 21:23
  • 6
    @MKatz What now, "a large file" or "a file of about 51 MB"? ;-) – glglgl Mar 11 '14 at 20:27
  • this solution might miss out the last line but that might not matter for a huge file. – minhle_r7 Jul 25 '17 at 13:06
  • @ngọcminh.oss Only if the last line is incomplete. A text file is defined to end with a line break, see http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 and https://stackoverflow.com/a/729795/296974. – glglgl Jul 25 '17 at 14:10
  • 2
    not that people care about definition. when you work with real data, everything is messy. but it doesn't matter anyway. – minhle_r7 Jul 26 '17 at 17:19
  • Re missing lines (i.e. lines that don't end in a "line break"): Relatively unimportant, I suppose, if the file is large. But I have files that vary from huge to one-liners. Unfortunately, some of the one-liners lack a trailing newline, and the program that uses this function assumes a return value of 0 means an empty file...which may not be true. So I had to do some other checking. – Mike Maxwell Oct 12 '20 at 18:12
23

I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])

If you're on Windows, check out Coreutils.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
17

A fast, 1-line solution is:

sum(1 for i in open(file_path, 'rb'))

It should work on files of arbitrary size.

Yonas Kassa
  • 3,362
  • 1
  • 18
  • 27
AJSmyth
  • 419
  • 6
  • 13
  • I confirm that this is the fastest one (except the `wc -l` hack). Using the text mode gives a little dropdown in performance, but it is insignificant in comparison with other solutions. – ei-grad Nov 18 '18 at 13:38
  • 1
    There is an unneeded extra generator parenthesis, btw. – ei-grad Nov 18 '18 at 13:40
  • 1
    Without the unneeded extra generator parenthesis, it appears to be sligntly faster (per timeit) and consumes about 3MB less memory (per memit for a file of 100,000 lines).. – mikey Jan 07 '19 at 04:16
  • doesn't seem to work if the file is a text file with newlines. My problem are large txt files that need character counting. – Math is Hard Feb 07 '21 at 04:40
  • 3
    The file is not closed. – Jeyekomon Mar 10 '22 at 12:44
5

mmap the file, and count up the newlines.

import mmap

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
2

I'd extend gl's answer and run his/her code using multiprocessing Python module for faster count:

def blocks(f, cut, size=64*1024): # 65536
    start, chunk =cut
    iter=0
    read_size=int(size)
    _break =False
    while not _break:
        if _break: break
        if f.tell()+size>start+chunk:
            read_size=int(start+chunk- f.tell() )
            _break=True
        b = f.read(read_size)
        iter +=1
        if not b: break
        yield b


def get_chunk_line_count(data):
    fn,  chunk_id, cut = data
    start, chunk =cut
    cnt =0
    last_bl=None

    with open(fn, "r") as f:
        if 0:
            f.seek(start)
            bl = f.read(chunk)
            cnt= bl.count('\n')
        else:
            f.seek(start)
            for i, bl  in enumerate(blocks(f,cut)):
                cnt +=  bl.count('\n')
                last_bl=bl

        if not last_bl.endswith('\n'):
            cnt -=1

        return cnt
....
pool = multiprocessing.Pool(processes=pool_size,
                            initializer=start_process,
                            )
pool_outputs = pool.map(get_chunk_line_count, inputs)
pool.close() # no more tasks
pool.join() 

This will improve counting performance 20 folds. I wrapped it to a script and put it to Github.

olekb
  • 638
  • 1
  • 9
  • 28
  • @olekb Thank you for sharing the multiprocessing approach. Quick question as a newbie, how do we run this code to count a line in a big file (say, 'myfile.txt')? Do I tried `pool = multiprocessing.Pool(4); pool_outputs = pool.map(get_chunk_line_count, 'myfile.txt')`, but that causes error. Thank you in advanced for your answer! – user1330974 Aug 28 '19 at 19:08