(Python) Counting lines in a huge (>10GB) file as fast as possible

Question

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!

In general it would probably be faster to treat the file as binary data, read through it in reasonably-sized chunks (say, 4KB at a time), and count the `\n` characters in each chunk as you go. — aroth, Mar 09 '12 at 05:11
This is not better performing than your naive solution, but fyi the pythonic way to write what you have here would be simply `with open(fname) as f: print sum(1 for line in f)` — wim, Mar 09 '12 at 05:37
aroth: Thanks for the tip, I should look into that. wim: great, thanks, that's much shorter... — Adrienne, Mar 09 '12 at 05:43
Take a look at [rawbigcount](http://stackoverflow.com/a/27517681/3420199) at Michael Bacon's answer. It may be helpful you! — Diogo, Feb 03 '16 at 13:20

score 47 · Accepted Answer · edited Apr 02 '18 at 17:21

47

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r") as f:
    print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
    print (sum(bl.count("\n") for bl in blocks(f)))

edited Apr 02 '18 at 17:21

SU3

5,064
3
35
66

answered Mar 09 '12 at 09:24

glglgl

89,107
13
149
217

1

Just as one data point, a read of a large file of about 51 MB went from about a minute using the naive approach to under one second using this approach. – M Katz Dec 16 '13 at 21:23
6

@MKatz What now, "a large file" or "a file of about 51 MB"? ;-) – glglgl Mar 11 '14 at 20:27
this solution might miss out the last line but that might not matter for a huge file. – minhle_r7 Jul 25 '17 at 13:06
@ngọcminh.oss Only if the last line is incomplete. A text file is defined to end with a line break, see http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206 and https://stackoverflow.com/a/729795/296974. – glglgl Jul 25 '17 at 14:10
2

not that people care about definition. when you work with real data, everything is messy. but it doesn't matter anyway. – minhle_r7 Jul 26 '17 at 17:19
Re missing lines (i.e. lines that don't end in a "line break"): Relatively unimportant, I suppose, if the file is large. But I have files that vary from huge to one-liners. Unfortunately, some of the one-liners lack a trailing newline, and the program that uses this function assumes a return value of 0 means an empty file...which may not be true. So I had to do some other checking. – Mike Maxwell Oct 12 '20 at 18:12

score 23 · Answer 2 · edited Feb 25 '19 at 12:50

23

I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])

If you're on Windows, check out Coreutils.

edited Feb 25 '19 at 12:50

Bram Vanroy

27,032
24
137
239

answered Mar 09 '12 at 09:31

Jakob Bowyer

33,878
8
76
91

My solution takes only 1m37 real time. – Jakob Bowyer Mar 09 '12 at 09:44
1

this is far faster – Hanan Shteingart Sep 14 '17 at 08:27
4

Seems like you need to do `int(subprocess.check_output("/usr/bin/wc -l cred", shell=True).split()[0])` for python3 – ZN13 Dec 04 '17 at 20:12
1

If you have large files or a lot of files, please consider using this approach if you are looking for pure performance without resorting to another language. – Victor 'Chris' Cabral Apr 09 '19 at 18:02

score 17 · Answer 3 · edited Jan 27 '22 at 10:31

17

A fast, 1-line solution is:

sum(1 for i in open(file_path, 'rb'))

It should work on files of arbitrary size.

edited Jan 27 '22 at 10:31

Yonas Kassa

3,362
1
18
27

answered Jun 02 '16 at 19:56

AJSmyth

419
6
13

I confirm that this is the fastest one (except the `wc -l` hack). Using the text mode gives a little dropdown in performance, but it is insignificant in comparison with other solutions. – ei-grad Nov 18 '18 at 13:38
1

There is an unneeded extra generator parenthesis, btw. – ei-grad Nov 18 '18 at 13:40
1

Without the unneeded extra generator parenthesis， it appears to be sligntly faster (per timeit) and consumes about 3MB less memory (per memit for a file of 100,000 lines).. – mikey Jan 07 '19 at 04:16
doesn't seem to work if the file is a text file with newlines. My problem are large txt files that need character counting. – Math is Hard Feb 07 '21 at 04:40
3

The file is not closed. – Jeyekomon Mar 10 '22 at 12:44

score 5 · Answer 4 · edited May 08 '23 at 06:07

5

mmap the file, and count up the newlines.

import mmap

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines

edited May 08 '23 at 06:07

Jean-Francois T.

11,549
7
68
107

answered Mar 09 '12 at 05:09

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

10

Please consider adding a short example to demonstrate this, thanks! – Kumba Aug 05 '18 at 19:57
5

short example might be a good idea, I agree – Kailegh Aug 21 '19 at 06:18

score 2 · Answer 5 · answered Dec 06 '16 at 21:21

I'd extend gl's answer and run his/her code using multiprocessing Python module for faster count:

def blocks(f, cut, size=64*1024): # 65536
    start, chunk =cut
    iter=0
    read_size=int(size)
    _break =False
    while not _break:
        if _break: break
        if f.tell()+size>start+chunk:
            read_size=int(start+chunk- f.tell() )
            _break=True
        b = f.read(read_size)
        iter +=1
        if not b: break
        yield b


def get_chunk_line_count(data):
    fn,  chunk_id, cut = data
    start, chunk =cut
    cnt =0
    last_bl=None

    with open(fn, "r") as f:
        if 0:
            f.seek(start)
            bl = f.read(chunk)
            cnt= bl.count('\n')
        else:
            f.seek(start)
            for i, bl  in enumerate(blocks(f,cut)):
                cnt +=  bl.count('\n')
                last_bl=bl

        if not last_bl.endswith('\n'):
            cnt -=1

        return cnt
....
pool = multiprocessing.Pool(processes=pool_size,
                            initializer=start_process,
                            )
pool_outputs = pool.map(get_chunk_line_count, inputs)
pool.close() # no more tasks
pool.join()

This will improve counting performance 20 folds. I wrapped it to a script and put it to Github.

@olekb Thank you for sharing the multiprocessing approach. Quick question as a newbie, how do we run this code to count a line in a big file (say, 'myfile.txt')? Do I tried `pool = multiprocessing.Pool(4); pool_outputs = pool.map(get_chunk_line_count, 'myfile.txt')`, but that causes error. Thank you in advanced for your answer! — user1330974, Aug 28 '19 at 19:08

(Python) Counting lines in a huge (>10GB) file as fast as possible

5 Answers5

Linked