1278

How do I get a line count of a large file in the most memory- and time-efficient manner?

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
  • 16
    Do you need exact line count or will an approximation suffice? – pico May 11 '09 at 20:14
  • 61
    I would add i=-1 before for loop, since this code doesn't work for empty files. – Maciek Sawicki Dec 27 '11 at 16:13
  • 14
    @Legend: I bet pico is thinking, get the file size (with seek(0,2) or equiv), divide by approximate line length. You could read a few lines at the beginning to guess the average line length. – Anne Feb 07 '12 at 17:02
  • 41
    `enumerate(f, 1)` and ditch the `i + 1`? – Ian Mackinnon Feb 21 '13 at 12:25
  • 6
    @IanMackinnon Works for empty files, but you have to initialize *i* to *0* before the for-loop. – scai Aug 13 '13 at 16:29
  • related: [Why is reading lines from stdin much slower in C++ than Python?](http://stackoverflow.com/q/9371238/4279). See comparison with [`wc-l.py` there](http://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python#comment11966378_9371238) – jfs Oct 09 '15 at 21:42
  • I originally came to this post trying to find a fast way of pre-allocating a table stored as text. However, in my case, I found that it is faster to append values to a list (allowing the list to grow dynamically) rather than read the file twice. Depending on your I/O speed, this may be something to think about. – Gordon Bean Oct 20 '15 at 00:36
  • 1
    There is a better way, doesn't change much. Add the "r" flag to the open function so it doesn't have to automatically figure out what flag to use. I timed it, that method is ~0.01 seconds slower without the "r" flag. – andrew Dec 03 '15 at 14:04
  • This code returns `1` for empty files *as well as* files that have 1 line without a newline. – Marco May 07 '19 at 14:26
  • 1
    My one line solution was `total_row_count = len(open(single_file).read().splitlines())`. Testing the speed of both against a 1GB csv file yours takes `1.7`seconds and mine takes `7.3` – mRyan Apr 27 '21 at 13:58
  • @mRyan reading and splitting lines is not ideal because you have to not only have the whole file in memory but also process it to extract the lines and store in a list before getting the length. For a one liner, the answer from Kyle is the best `num_lines = sum(1 for _ in open('myfile.txt'))` ... but it is still relatively slow compared to some other solutions (with buffers or `mmap` for example). See https://stackoverflow.com/a/76197308/1603480 – Jean-Francois T. May 08 '23 at 05:56

44 Answers44

793

One line, faster than the for loop of the OP (altought not the fastest) and very concise:

num_lines = sum(1 for _ in open('myfile.txt'))

You can also boost speed (and robustness) by using rbU mode and include it in a with block to close the file:

with open("myfile.txt", "rbU") as f:
    num_lines = sum(1 for _ in f)
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Kyle
  • 8,453
  • 1
  • 18
  • 10
  • 8
    its similar to sum(sequence of 1) every line is counting as 1. >>> [ 1 for line in range(10) ] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> sum( 1 for line in range(10) ) 10 >>> – James Sapam Dec 13 '13 at 05:22
  • 6
    num_lines = sum(1 for line in open('myfile.txt') if line.rstrip()) for filter empty lines – Honghe.Wu Mar 03 '14 at 09:26
  • 98
    as we open a file, will this be closed automatically once we iterate over all the elements? Is it required to 'close()'? I think we cannot use 'with open()' in this short statement, right? – Mannaggia Mar 18 '14 at 15:31
  • 17
    A slight lint improvement: `num_lines = sum(1 for _ in open('myfile.txt'))` – thlik Jun 13 '21 at 17:54
  • 4
    It's not any faster than the other solutions, see https://stackoverflow.com/a/68385697/353337. – Nico Schlömer Jul 14 '21 at 22:22
  • @Mannaggia we could also enclose it in a `with open(...) as f` block and loop over `f`. – Jean-Francois T. May 08 '23 at 01:22
  • 1
    'U' mode enabled universal newlines but was removed in Python 3.11 as it became the default behaviour in Python 3.0. Source: https://docs.python.org/3.10/library/functions.html#open – Seriously Jun 02 '23 at 10:03
427

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

[Edit May 2023]

As commented in many other answers, in Python 3 there are better alternatives. The for loop is not the most efficient. For example, using mmap or buffers is more efficient.

Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Yuval Adam
  • 161,610
  • 92
  • 305
  • 395
  • 8
    Exactly, even WC is reading through the file, but in C and it's probably pretty optimized. – Ólafur Waage May 10 '09 at 10:38
  • 7
    As far as I understand the Python file IO is done through C as well. http://docs.python.org/library/stdtypes.html#file-objects – Tomalak May 10 '09 at 10:41
  • 1
    `posix_fadvise()` might be used http://stackoverflow.com/questions/860893/is-python-automagically-parallelizing-io-and-cpu-or-memory-bound-sections/861004#861004 Though I've not noticed any improvement https://gist.github.com/0ac760859e614cd03652 – jfs Jan 31 '11 at 09:08
  • 11
    @Tomalak That's a red herring. While python and wc might be issuing the same syscalls, python has opcode dispatch overhead that wc doesn't have. – bobpoekert Jan 11 '13 at 22:53
  • 4
    You can approximate a line count by sampling. It can be thousands of times faster. See: http://www.documentroot.com/2011/02/approximate-line-count-for-very-large.html – Erik Aronesty Jun 14 '16 at 20:30
  • 6
    Other answers seem to indicate this categorical answer is wrong, and should therefore be deleted rather than kept as accepted. – Skippy le Grand Gourou Jan 25 '17 at 13:59
  • 1
    wouldn't generators or list comprehension methods with the sum() method be faster? – jimh Jul 16 '17 at 10:02
  • 2
    This answer is plain wrong. Please see glglgl's answer here: https://stackoverflow.com/a/9631635/217802 – minhle_r7 Jul 25 '17 at 13:04
  • 1
    Simply untrue. Finding lines means finding newlines. You can parallelise reading chunks of the file, and searching for newlines, for example by having multiple processes search regions of a memory mapped file. – Marcin Jan 09 '19 at 22:19
  • @DaveLiu want to explain why? – Hyperbole Nov 19 '19 at 17:55
  • 1
    @Hyperbole, the multiple different highly-upvoted answers. If you have uniform-sized data, it could be a simple matter of mathematical calculation. If you have distributed computing capabilities, you could do as martlark has done. Saying "You can't get any better than that" fails to take into consideration multiple conditions and is a sweeping generalization. Stackoverflow is about finding solutions to specific problems, not just "Well, that seems like about it." Yes, any solution is I/O bound, but as others have demonstrated, you can get closer to that bound than OP's code. – Dave Liu Nov 19 '19 at 19:48
  • @DaveLiu Okay. I was asking because your comment predates the two runner-up answers, so I thought maybe you had seen something they didn't. – Hyperbole Nov 19 '19 at 20:18
  • Funny that `wc` is mentioned as a great deal for this when it is known to perform very poorly for counting lines of a big file compared to `awk` or `grep`... – myradio Feb 19 '21 at 19:15
  • Another problem with "iterating by lines" approach is reading *whole* lines into memory. A huge line might eat up a lot of RAM for literally no reason. – Nikolaj Š. Jun 09 '22 at 07:50
229

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Edit: numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Ryan Ginstrom
  • 13,915
  • 5
  • 45
  • 60
  • 34
    It seems that `wccount()` is the fastest https://gist.github.com/0ac760859e614cd03652 – jfs Jan 31 '11 at 08:18
  • The buffered read is the fastest solution, not `mmap` or `wccount`. See https://stackoverflow.com/a/68385697/353337. – Nico Schlömer Jul 14 '21 at 22:23
  • @NicoSchlömer it depends the characteristics of your file. See https://stackoverflow.com/a/76197308/1603480 for a comparison of both on different files. – Jean-Francois T. May 08 '23 at 03:28
215

I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

Here are my timings:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46
Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41
Michael Bacon
  • 2,600
  • 1
  • 12
  • 8
  • 44
    I am working with 100Gb+ files, and your rawgencounts is the only feasible solution I have seen so far. Thanks! – soungalo Nov 10 '15 at 11:47
  • 2
    is `wccount` in this table for the subprocess shell `wc` tool? – Anentropic Nov 11 '15 at 18:05
  • 1
    found this in another comment, I guess it is then https://gist.github.com/zed/0ac760859e614cd03652 – Anentropic Nov 11 '15 at 18:33
  • Would changing the return statement in the first example to `return sum(map(methodcaller("count", b'\n'), f_gen))`, importing `methodcaller` from `operator` help speed this up any (`imap` from `itertools` as well if python2)? I would also constify the 1024*1024 math in `_make_gen` to save a few extra cycles. Would like to see the comparison with the pure-generator example as well. – Kumba Aug 05 '18 at 22:57
  • 7
    Thanks @michael-bacon, it's a really nice solution. You can make the `rawincount` solution less weird looking by using `bufgen = iter(partial(f.raw.read, 1024*1024), b'')` instead of combining `takewhile` and `repeat`. – Peter H. Aug 06 '19 at 06:32
  • 2
    Oh, partial function, yeah, that's a nice little tweak. Also, I assumed that the 1024*1024 would get merged by the interpreter and treated as a constant but that was on hunch not documentation. – Michael Bacon Aug 08 '19 at 16:20
  • 3
    @MichaelBacon, would it be faster to open the file with `buffering=0` and then calling read instead of just opening the file as "rb" and calling raw.read, or will that be optimized to the same thing? – Avraham Nov 19 '19 at 18:53
  • 1
    @Avraham this is a late reply but I am not enough of a python core junkie to know that for sure. All I can say is, run the speed tests and see! – Michael Bacon Apr 03 '20 at 01:28
  • If the file is opened via gzip.open you get `AttributeError: 'GzipFile' object has no attribute 'raw'`. I think `read_f` line should be replaced with `read_f = f.raw.read if hasattr(f, 'raw') and hasattr(f.raw, 'read') else f.read` – Tns Sep 16 '20 at 12:56
  • Also for plain (not compressed) files you may want to use `mmap` – Tns Sep 16 '20 at 12:58
  • This is really fast. Still one should probably close the file handler before leaving the function, right? – Kyle Barron Dec 01 '20 at 04:03
  • 1
    @Avraham checked this - looks like the same time on my data – Mikhail_Sam Jun 09 '21 at 11:48
  • 1
    Good solution. I've also checked @Avraham's suggestion and got - on average - a slight improvement (it's also a bit closer to the standard api use - whatever that means :)). – Timus Jun 20 '22 at 08:39
105

You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])
nosklo
  • 217,122
  • 57
  • 293
  • 297
Ólafur Waage
  • 68,817
  • 22
  • 142
  • 198
  • 8
    what would be the windows version of this? – SilentGhost May 10 '09 at 10:30
  • 2
    You can refer to this SO question regarding that. http://stackoverflow.com/questions/247234/do-you-know-a-similar-program-for-wc-unix-word-count-command-on-windows – Ólafur Waage May 10 '09 at 10:32
  • 7
    Indeed, in my case (Mac OS X) this takes 0.13s versus 0.5s for counting the number of lines "for x in file(...)" produces, versus 1.0s counting repeated calls to str.find or mmap.find. (The file I used to test this has 1.3 million lines.) – bendin May 10 '09 at 12:06
  • 1
    No need to involve the shell on that. edited answer and added example code; – nosklo May 11 '09 at 12:23
  • 1
    On command line (without the overhead of creating another shell) this is the same fast as the more clear and portable python-only solution. See also: http://stackoverflow.com/questions/849058/is-it-possible-to-speed-up-python-io – Davide May 11 '09 at 17:34
  • 3
    Is not cross platform. – e-info128 Apr 12 '17 at 15:03
  • And by "cross-platform" you mean it doesn't work on Windows. – CalculatorFeline May 24 '17 at 21:10
66

After a perfplot analysis, one has to recommend the buffered read solution

def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        while True:
            b = reader(2 ** 16)
            if not b: break
            yield b

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count

It's fast and memory-efficient. Most other solutions are about 20 times slower.

enter image description here


Code to reproduce the plot:

import mmap
import subprocess
from functools import partial

import perfplot


def setup(n):
    fname = "t.txt"
    with open(fname, "w") as f:
        for i in range(n):
            f.write(str(i) + "\n")
    return fname


def for_enumerate(fname):
    i = 0
    with open(fname) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1


def sum1(fname):
    return sum(1 for _ in open(fname))


def mmap_count(fname):
    with open(fname, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)

    lines = 0
    while buf.readline():
        lines += 1
    return lines


def for_open(fname):
    lines = 0
    for _ in open(fname):
        lines += 1
    return lines


def buf_count_newlines(fname):
    lines = 0
    buf_size = 2 ** 16
    with open(fname) as f:
        buf = f.read(buf_size)
        while buf:
            lines += buf.count("\n")
            buf = f.read(buf_size)
    return lines


def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def wc_l(fname):
    return int(subprocess.check_output(["wc", "-l", fname]).split()[0])


def sum_partial(fname):
    with open(fname) as f:
        count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
    return count


def read_count(fname):
    return open(fname).read().count("\n")


b = perfplot.bench(
    setup=setup,
    kernels=[
        for_enumerate,
        sum1,
        mmap_count,
        for_open,
        wc_l,
        buf_count_newlines,
        buf_count_newlines_gen,
        sum_partial,
        read_count,
    ],
    n_range=[2 ** k for k in range(27)],
    xlabel="num lines",
)
b.save("out.png")
b.show()
Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
  • 1
    I hae very long lines in my file; i'm thinking the buffer should be allocated only once using `readinto` – fuzzyTew Aug 27 '22 at 11:29
  • Great graph: thanks for the code. But actually, this overlooks the case where a line is more than just than 10 characters. For long lines, `mmap` tends to be more efficient than `buf_count_newlines_gen`: see answer https://stackoverflow.com/a/76197308/1603480 – Jean-Francois T. May 08 '23 at 03:23
49

Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel( logging.INFO )
    logger.handlers.append( logging.StreamHandler() )
    logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )

def getFileLineCount( queues, pid, processes, file1 ):
    init_logger(pid)
    logging.info( 'start' )

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )

    #work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    #get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    #find where to start
    if pid > 0:
        m1.seek( seekStart )
        #read next line
        l1 = m1.readline()  # need to use readline with memory mapped files
        seekStart = m1.tell()

    #tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put( seekStart )
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek( seekStart )
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info( 'done' )
    # add up the results
    if pid == 0:
        for p in range(1,processes):
            lines += queues[0].get()
        queues[0].put(lines) # the total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger( 'main' )
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal( 'parameters required: file-name [processes]' )
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues=[] # a queue for each process
    for pid in range(processes):
        queues.append( multiprocessing.Queue() )
    jobs=[]
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
        p.start()
        jobs.append(p)

    jobs[0].join() #wait for counting to finish
    lines = queues[0].get()

    logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
namit
  • 6,780
  • 4
  • 35
  • 41
Martlark
  • 14,208
  • 13
  • 83
  • 99
  • How does this work with files much bigger than main memory? for instance a 20GB file on a system with 4GB RAM and 2 cores – Brian Minton Sep 23 '14 at 21:18
  • Hard to test now, but I presume it would page the file in and out. – Martlark Sep 24 '14 at 11:32
  • 6
    This is pretty neat code. I was surprised to find that it is faster to use multiple processors. I figured that the IO would be the bottleneck. In older Python versions, line 21 needs int() like chunk = int((fSize / processes)) + 1 – Karl Henselin Dec 30 '14 at 19:45
  • do it load all the file into the memory? what about a bigger fire where the size is bigger then the ram on the computer? – pelos Dec 21 '18 at 21:30
  • The files are mapped into virtual memory, so the size of the file and the amount of actual memory is usually not a restriction. – Martlark Dec 23 '18 at 22:51
  • 1
    Would you mind if I formatted the answer with black? https://black.vercel.app/ – Martin Thoma Feb 05 '22 at 08:27
  • It does need it – Martlark Feb 06 '22 at 11:41
46

A one-line bash solution similar to this answer, using the modern subprocess.check_output function:

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])
1''
  • 26,823
  • 32
  • 143
  • 200
  • 4
    This answer should be voted up to a higher spot in this thread for Linux/Unix users. Despite the majority preferences in a cross-platform solution, this is a superb way on Linux/Unix. For a 184-million-line csv file I have to sample data from, it provides the best runtime. Other pure python solutions take on average 100+ seconds whereas subprocess call of `wc -l` takes ~ 5 seconds. – Shan Dou Jun 27 '18 at 16:06
  • `shell=True` is bad for security, it is better to avoid it. – Alexey Vazhnov May 09 '20 at 22:16
18

I would use Python's file object method readlines, as follows:

with open(input_file) as foo:
    lines = len(foo.readlines())

This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.

Daniel Lee
  • 2,030
  • 1
  • 23
  • 29
  • 9
    While this is one of the first ways that comes to mind, it probably isn't very memory efficient, especially if counting lines in files up to 10 GB (Like I do), which is a noteworthy disadvantage. – Steen Schütt Apr 17 '14 at 15:36
  • @TimeSheep Is this an issue for files with _many_ (say, billions) of small lines, or files which have extremely long lines (say, Gigabytes per line)? – robert Jun 03 '18 at 17:40
  • The reason I ask is, it would seem that the compiler should be able to optimize this away by not creating an intermediate list. – robert Jun 03 '18 at 17:41
  • @dmityugov Per Python docs, `xreadlines` has been deprecated since 2.3, as it just returns an iterator. `for line in file` is the stated replacement. See: https://docs.python.org/2/library/stdtypes.html#file.xreadlines – Kumba Aug 05 '18 at 22:53
13

This is the fastest thing I have found using pure python. You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.

Community
  • 1
  • 1
jeffpkamp
  • 2,732
  • 2
  • 27
  • 51
12

Here is what I use, seems pretty clean:

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.

radtek
  • 34,210
  • 11
  • 144
  • 111
  • 3
    Just as a side note, this won't work on Windows of course. – Bram Vanroy Feb 25 '19 at 12:51
  • core utils apparently provides "wc" for windows https://stackoverflow.com/questions/247234/do-you-know-a-similar-program-for-wc-unix-word-count-command-on-windows. You can also use a linux VM in your windows box if your code will end up running in linux in prod. – radtek Feb 25 '19 at 16:45
  • Or WSL, highly advised over any VM if stuff like this is the only thing you do. `:-)` – Bram Vanroy Feb 25 '19 at 16:59
  • Yeah that works. I'm not a windows guy but from goolging I learned WSL = Windows Subsystem for Linux =) – radtek Feb 25 '19 at 21:39
  • 3
    python3.7: subprocess return bytes, so code looks like this: int(subprocess.check_output(['wc', '-l', file_path]).decode("utf-8").lstrip().split(" ")[0]) – Alexey Alexeenka Dec 17 '19 at 08:21
12
def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines
pkit
  • 7,993
  • 6
  • 36
  • 36
  • The command "sum(1 for line in f)" seems to delete the content of the file. The command "f.readline()" returns null if I put it after that line. – Ente Fetz May 12 '21 at 15:35
  • @EnteFetz that's because the file handle is exhausted, so there are no more lines to read. If you do `f.seek(0); f.readline()` this problem won't persist – C.Nivs Aug 05 '22 at 02:57
10

One line solution:

import os
os.system("wc -l  filename")  

My snippet:

>>> os.system('wc -l *.txt')

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total
kalehmann
  • 4,821
  • 6
  • 26
  • 36
TheExorcist
  • 1,966
  • 1
  • 19
  • 25
  • 4
    Good idea, unfortunately this does not work on Windows though. – Kim Jan 20 '17 at 20:06
  • 4
    if you want to be surfer of python , say good bye to windows.Believe me you will thank me one day . – TheExorcist Jan 22 '17 at 10:38
  • 9
    I just considered it noteworthy that this will only work on windows. I prefer working on a linux/unix stack myself, but when writing software IMHO one should consider the side effects a program could have when run under different OSes. As the OP did not mention his platform and in case anyone pops on this solution via google and copies it (unaware of the limitations a Windows system might have), I wanted to add the note. – Kim Jan 22 '17 at 12:42
  • You can't save output of `os.system()` to variable and post-process it anyhow. – An Se Jan 16 '20 at 09:16
  • @AnSe you are correct but question is not asked whether it saves or not.I guess you are understanding the context. – TheExorcist Jan 16 '20 at 10:49
  • @TheExorcist, nope, in question author actually uses a func that returns a value. – An Se Jan 16 '20 at 10:52
  • @AnSe yes sir you are right, but question asked was What is the most efficient way both memory- and time-wise? later OP ask to suggest some other ways, in answer OP used a return value function as an instance.But may be your are right, I will throw this question to meta-tags, for further analysis. – TheExorcist Feb 11 '20 at 11:21
  • If you want ONLY the count change to it os.system("wc -l < filename") –  May 10 '21 at 19:50
9

Kyle's answer

num_lines = sum(1 for line in open('my_file.txt'))

is probably best, an alternative for this is

num_lines =  len(open('my_file.txt').read().splitlines())

Here is the comparision of performance of both

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
Community
  • 1
  • 1
Chillar Anand
  • 27,936
  • 9
  • 119
  • 136
7

I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

You can play around with the buffer size and maybe see a little improvement.

Scott Persinger
  • 3,554
  • 2
  • 20
  • 12
  • Nice. To account for files that don't end in \n, add 1 outside of loop if buffer and buffer[-1]!='\n' – ryuusenshi Nov 14 '13 at 18:37
  • A bug: buffer in the last round might not be clean. – Jay Nov 29 '14 at 05:07
  • what if in between buffers one portion ends with \ and the other portion starts with n? that will miss one new line in there, I would sudgest to variables to store the end and the start of each chunk, but that might add more time to the script =( – pelos Dec 19 '18 at 15:47
5

Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...

BandGap
  • 1,745
  • 4
  • 19
  • 26
5

This code is shorter and clearer. It's probably the best way:

num_lines = open('yourfile.ext').read().count('\n')
Texom512
  • 4,785
  • 3
  • 16
  • 16
5

As for me this variant will be the fastest:

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

reasons: buffering faster than reading line by line and string.count is also very fast

SilentGhost
  • 307,395
  • 66
  • 306
  • 293
Mykola Kharechko
  • 3,104
  • 5
  • 31
  • 40
4

I have modified the buffer case like this:

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

Now also empty files and the last line (without \n) are counted.

Dummy
  • 41
  • 1
  • Maybe also explain (or add in comment in the code) what you changed and what for ;). Might give people some more inside in your code much easier (rather than "parsing" the code in the brain). – Styxxy Nov 06 '12 at 00:50
  • The loop optimization I think allows Python to do a local variable lookup at read_f, https://www.python.org/doc/essays/list2str/ – Nate Anderson Apr 03 '15 at 15:39
3
print open('file.txt', 'r').read().count("\n") + 1
Andrés Torres
  • 747
  • 5
  • 16
3

Simple method:

1)

>>> f = len(open("myfile.txt").readlines())
>>> f

430
>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>
num_lines = len(list(open('myfile.txt')))
Innat
  • 16,113
  • 6
  • 53
  • 101
Mohideen bin Mohammed
  • 18,813
  • 10
  • 112
  • 118
3

A lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...

I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.

The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.

Hence, there are 3 major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gc collection and other micro-managing tricks:

  1. Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.

  2. Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can pre-process them, first convert the line return to unix style (\n) as this will save 1 character compared to Windows or MacOS styles (not a big save but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.

  3. Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).

If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.

gaborous
  • 15,832
  • 10
  • 83
  • 102
2

If one wants to get the line count cheaply in Python in Linux, I recommend this method:

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path can be both abstract file path or relative path. Hope this may help.

Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66
2
def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count
jciloa
  • 1,039
  • 1
  • 11
  • 22
  • Could you please explain what is wrong with it if you think it is wrong? It worked for me. Thanks! – jciloa Dec 20 '17 at 17:04
  • I would be interested in why this answer was downvoted, too. It iterates over the file by lines and sums them up. I like it, it is short and to the point, what's wrong with it? – cessor Mar 16 '18 at 11:23
2

Using Numba

We can use Numba to JIT (Just in time) compile our function to machine code. def numbacountparallel(fname) runs 2.8x faster than def file_len(fname) from the question.

Notes:

The OS had already cached the file to memory before the benchmarks were run as I don't see much disk activity on my PC. The time would be much slower when reading the file for the first time making the time advantage of using Numba insignificant.

The JIT compilation takes extra time the first time the function is called.

This would be useful if we were doing more than just counting lines.

Cython is another option.

http://numba.pydata.org/

Conclusion

As counting lines will be IO bound, use the def file_len(fname) from the question unless you want to do more than just count lines.

import timeit

from numba import jit, prange
import numpy as np

from itertools import (takewhile,repeat)

FILE = '../data/us_confirmed.csv' # 40.6MB, 371755 line file
CR = ord('\n')


# Copied from the question above. Used as a benchmark
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


# Copied from another answer. Used as a benchmark
def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.read(1024*1024*10) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )


# Single thread
@jit(nopython=True)
def numbacountsingle_chunk(bs):

    c = 0
    for i in range(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountsingle(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountsingle_chunk(chunk)
        total += lines
        if not chunk:
            break

    return total


# Multi thread
@jit(nopython=True, parallel=True)
def numbacountparallel_chunk(bs):

    c = 0
    for i in prange(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountparallel(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountparallel_chunk(np.frombuffer(chunk, dtype=np.uint8))
        total += lines
        if not chunk:
            break

    return total

print('numbacountparallel')
print(numbacountparallel(FILE)) # This allows Numba to compile and cache the function without adding to the time.
print(timeit.Timer(lambda: numbacountparallel(FILE)).timeit(number=100))

print('\nnumbacountsingle')
print(numbacountsingle(FILE))
print(timeit.Timer(lambda: numbacountsingle(FILE)).timeit(number=100))

print('\nfile_len')
print(file_len(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

print('\nrawincount')
print(rawincount(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

Time in seconds for 100 calls to each function

numbacountparallel
371755
2.8007332000000003

numbacountsingle
371755
3.1508585999999994

file_len
371755
6.7945494

rawincount
371755
6.815438
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Paul Menzies
  • 161
  • 1
  • 4
2

This is a meta-comment on some of the other answers.

  • The line-reading and buffered \n-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's not b'\n'.

  • In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to '\n'), while in binary mode only LF and CRLF will be counted if you count b'\n'. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days.

  • The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage.

  • You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using readinto instead of read. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes).

  • The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.

benrg
  • 1,395
  • 11
  • 13
1

How about this one-liner:

file_length = len(open('myfile.txt','r').read().split('\n'))

Takes 0.003 sec using this method to time it on a 3900 line file

def c():
  import time
  s = time.time()
  file_length = len(open('myfile.txt','r').read().split('\n'))
  print time.time() - s
onetwopunch
  • 3,279
  • 2
  • 29
  • 44
1

count = max(enumerate(open(filename)))[0]

pyanon
  • 1,065
  • 6
  • 3
1

An alternative for big files is using xreadlines():

count = 0
for line in open(thefilepath).xreadlines(  ): count += 1

For Python 3 please see: What substitutes xreadlines() in Python 3?

blackbrandt
  • 2,010
  • 1
  • 15
  • 32
alexisdevarennes
  • 5,437
  • 4
  • 24
  • 38
1

How about this?

import fileinput
import sys

counter=0
for line in fileinput.input([sys.argv[1]]):
    counter+=1

fileinput.close()
print counter
leba-lev
  • 2,788
  • 10
  • 33
  • 43
1

There are already so many answers with great timing comparison, but I believe they are just looking at number of lines to measure performance (e.g. great graph from Nico Schlömer https://stackoverflow.com/a/68385697/1603480).

To be accurate while measuring performance, we should look at:

  • the number of lines
  • the average size of the lines
  • ... the resulting total size of the file (which might impact memory)

First of all, the function of the OP (with a for) and the function sum(1 for line in f) are not performing that well...

Good contenders are with mmap or buffer.

To summarize: based on my analysis (Python 3.9 on Windows with SSD):

  1. For big files with relatively short lines (within 100 characters): use function with a buffer buf_count_newlines_gen
def buf_count_newlines_gen(fname: str) -> int:
    """Count the number of lines in a file"""
    def _make_gen(reader):
        b = reader(1024 * 1024)
        while b:
            yield b
            b = reader(1024 * 1024)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count

  1. For files with potentially longer lines (up to 2000 characters), disregarding the number of lines: use function with mmap: count_nb_lines_mmap
def count_nb_lines_mmap(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, mode="rb") as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        nb_lines = 0
        while mm.readline():
            nb_lines += 1
        mm.close()
        return nb_lines
  1. For a short code with very good performance (especially for files of size up to medium size):
def itercount(filename: str) -> int:
    """Count the number of lines in a file"""
    with open(filename, 'rbU') as f:
        return sum(1 for _ in f)

Here is a summary of the different metrics (average time with timeit on 7 runs with 10 loops each):

Function Small file, short lines Small file, long lines Big file, short lines Big file, long lines Bigger file, short lines
... size ... 0.04 MB 1.16 MB 318 MB 17 MB 328 MB
... nb lines ... 915 lines < 100 chars 915 lines < 2000 chars 389000 lines < 100 chars 389,000 lines < 2000 chars 9.8 millions lines < 100 chars
count_nb_lines_blocks 0.183 ms 1.718 ms 36.799 ms 415.393 ms 517.920 ms
count_nb_lines_mmap 0.185 ms 0.582 ms 44.801 ms 185.461 ms 691.637 ms
buf_count_newlines_gen 0.665 ms 1.032 ms 15.620 ms 213.458 ms 318.939 ms
itercount 0.135 ms 0.817 ms 31.292 ms 223.120 ms 628.760 ms

Note: I have also compared count_nb_lines_mmap and buf_count_newlines_gen on a file of 8 GB, with 9.7 million lines of more than 800 characters. We got an average of 5.39s for buf_count_newlines_gen vs 4.2s for count_nb_lines_mmap, so this latter function seems indeed better for files with longer lines.

Here is the code I have used:

import mmap
from pathlib import Path

def count_nb_lines_blocks(file: Path) -> int:
    """Count the number of lines in a file"""

    def blocks(files, size=65536):
        while True:
            b = files.read(size)
            if not b:
                break
            yield b

    with open(file, encoding="utf-8", errors="ignore") as f:
        return sum(bl.count("\n") for bl in blocks(f))


def count_nb_lines_mmap(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, mode="rb") as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        nb_lines = 0
        while mm.readline():
            nb_lines += 1
        mm.close()
        return nb_lines


def count_nb_lines_sum(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, "r", encoding="utf-8", errors="ignore") as f:
        return sum(1 for line in f)


def count_nb_lines_for(file: Path) -> int:
    """Count the number of lines in a file"""
    i = 0
    with open(file) as f:
        for i, _ in enumerate(f, start=1):
            pass
    return i


def buf_count_newlines_gen(fname: str) -> int:
    """Count the number of lines in a file"""
    def _make_gen(reader):
        b = reader(1024 * 1024)
        while b:
            yield b
            b = reader(1024 * 1024)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def itercount(filename: str) -> int:
    """Count the number of lines in a file"""
    with open(filename, 'rbU') as f:
        return sum(1 for _ in f)


files = [small_file, big_file, small_file_shorter, big_file_shorter, small_file_shorter_sim_size, big_file_shorter_sim_size]
for file in files:
    print(f"File: {file.name} (size: {file.stat().st_size / 1024 ** 2:.2f} MB)")
    for func in [
        count_nb_lines_blocks,
        count_nb_lines_mmap,
        count_nb_lines_sum,
        count_nb_lines_for,
        buf_count_newlines_gen,
        itercount,
    ]:
        result = func(file)
        time = Timer(lambda: func(file)).repeat(7, 10)
        print(f" * {func.__name__}: {result} lines in {mean(time) / 10 * 1000:.3f} ms")
    print()
File: small_file.ndjson (size: 1.16 MB)
 * count_nb_lines_blocks: 915 lines in 1.718 ms
 * count_nb_lines_mmap: 915 lines in 0.582 ms
 * count_nb_lines_sum: 915 lines in 1.993 ms
 * count_nb_lines_for: 915 lines in 3.876 ms
 * buf_count_newlines_gen: 915 lines in 1.032 ms
 * itercount: 915 lines in 0.817 ms

File: big_file.ndjson (size: 317.99 MB)
 * count_nb_lines_blocks: 389000 lines in 415.393 ms
 * count_nb_lines_mmap: 389000 lines in 185.461 ms
 * count_nb_lines_sum: 389000 lines in 485.370 ms
 * count_nb_lines_for: 389000 lines in 967.075 ms
 * buf_count_newlines_gen: 389000 lines in 213.458 ms
 * itercount: 389000 lines in 223.120 ms

File: small_file__shorter.ndjson (size: 0.04 MB)
 * count_nb_lines_blocks: 915 lines in 0.183 ms
 * count_nb_lines_mmap: 915 lines in 0.185 ms
 * count_nb_lines_sum: 915 lines in 0.251 ms
 * count_nb_lines_for: 915 lines in 0.244 ms
 * buf_count_newlines_gen: 915 lines in 0.665 ms
 * itercount: 915 lines in 0.135 ms

File: big_file__shorter.ndjson (size: 17.42 MB)
 * count_nb_lines_blocks: 389000 lines in 36.799 ms
 * count_nb_lines_mmap: 389000 lines in 44.801 ms
 * count_nb_lines_sum: 389000 lines in 59.068 ms
 * count_nb_lines_for: 389000 lines in 81.387 ms
 * buf_count_newlines_gen: 389000 lines in 15.620 ms
 * itercount: 389000 lines in 31.292 ms

File: small_file__shorter_sim_size.ndjson (size: 1.21 MB)
 * count_nb_lines_blocks: 36457 lines in 1.920 ms
 * count_nb_lines_mmap: 36457 lines in 2.615 ms
 * count_nb_lines_sum: 36457 lines in 3.993 ms
 * count_nb_lines_for: 36457 lines in 6.011 ms
 * buf_count_newlines_gen: 36457 lines in 1.363 ms
 * itercount: 36457 lines in 2.147 ms

File: big_file__shorter_sim_size.ndjson (size: 328.19 MB)
 * count_nb_lines_blocks: 9834248 lines in 517.920 ms
 * count_nb_lines_mmap: 9834248 lines in 691.637 ms
 * count_nb_lines_sum: 9834248 lines in 1109.669 ms
 * count_nb_lines_for: 9834248 lines in 1683.859 ms
 * buf_count_newlines_gen: 9834248 lines in 318.939 ms
 * itercount: 9834248 lines in 628.760 ms
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
1

the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.

Andrew Jaffe
  • 26,554
  • 4
  • 50
  • 59
  • 12
    which means that 100 Mb file will need to be read into the memory. – SilentGhost May 10 '09 at 11:41
  • yep, good point, although I wonder about the speed (as opposed to memory) difference. It's probably possible to create an iterator that does this, but I think it would be equivalent to your solution. – Andrew Jaffe May 10 '09 at 11:53
  • 6
    -1, it's not just the memory, but having to construct the list in memory. – orip Sep 21 '09 at 21:14
1

What about this

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()
odwl
  • 2,095
  • 2
  • 17
  • 15
0
def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count
Michael Whatcott
  • 5,603
  • 6
  • 36
  • 50
0

You can use the os.path module in the following way:

import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

, where Filename is the absolute path of the file.

Victor
  • 1,014
  • 1
  • 9
  • 11
0

Create an executable script file named count.py:

#!/usr/bin/python

import sys
count = 0
for line in sys.stdin:
    count+=1

And then pipe the file's content into the python script: cat huge.txt | ./count.py. Pipe works also on Powershell, so you will end up counting number of lines.

For me, on Linux it was 30% faster than the naive solution:

count=1
with open('huge.txt') as f:
    count+=1
0x90
  • 39,472
  • 36
  • 165
  • 245
0

Simplest and shortest way I would use is:

f = open("my_file.txt", "r")
len(f.readlines())
DesiKeki
  • 656
  • 8
  • 9
  • 1
    This has been spotted as [not memory efficient](https://stackoverflow.com/a/52365651/2227743) in another answer. – Eric Aya Aug 11 '21 at 09:16
0

Why not read the first 100 and the last 100 lines and estimate the average line length, then divide the total file size through that numbers? If you don't need a exact value this could work.

Georg Schölly
  • 124,188
  • 49
  • 220
  • 267
  • I need a exact value, but the problem is that in general case line length could be fairly different. I'm afraid though that your approach won't be the most efficient one. – SilentGhost May 10 '09 at 18:50
-1

Similarly:

lines = 0
with open(path) as f:
    for line in f:
        lines += 1
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
-1

Another possibility:

import subprocess

def num_lines_in_file(fpath):
    return int(subprocess.check_output('wc -l %s' % fpath, shell=True).strip().split()[0])
whitebeard
  • 1,071
  • 14
  • 19
J.J.
  • 350
  • 2
  • 6
-1

If the file can fit into memory, then

with open(fname) as f:
    count = len(f.read().split(b'\n')) - 1
Karthik
  • 49
  • 2
-1

If all the lines in your file are the same length (and contain only ASCII characters)*, you can do the following very cheaply:

fileSize     = os.path.getsize( pathToFile )  # file size in bytes
bytesPerLine = someInteger                    # don't forget to account for the newline character
numLines     = fileSize // bytesPerLine

*I suspect more effort would be required to determine the number of bytes in a line if unicode characters like é are used.

Jet Blue
  • 5,109
  • 7
  • 36
  • 48
-3

what about this?

import sys
sys.stdin=open('fname','r')
data=sys.stdin.readlines()
print "counted",len(data),"lines"
S.C
  • 1
-3

Why wouldn't the following work?

import sys

# input comes from STDIN
file = sys.stdin
data = file.readlines()

# get total number of lines in file
lines = len(data)

print lines

In this case, the len function uses the input lines as a means of determining the length.

  • 5
    The question is not how to get the line count, I've demonstrated in the question itself what I was doing: the question was how to do that efficiently. In your solution the whole file is read into the memory, which is at least inefficient for large files and at most impossible for huge ones. – SilentGhost Dec 05 '10 at 18:28
  • 2
    Actually it's likely very efficient except when it's impossible. :-) – kindall Jul 19 '11 at 18:23