27

For example if my text file is:

blue
green
yellow
black

Here there are four lines and now I want to get the result as four. How can I do that?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • open('data.txt') as fp: for line in fp: if line.strip(): count += 1 –  Sep 25 '13 at 09:46
  • 1
    Yes, it will work, but the solution is not pythonic, better use `sum()`. – alecxe Sep 25 '13 at 09:57
  • 3
    http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python is more than enough explanation ;-) – Robert Caspary Sep 25 '13 at 10:28
  • 1
    Possible duplicate of [How to get line count cheaply in Python?](http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python) – Martin Thoma Jan 23 '17 at 09:59

11 Answers11

53

You can use sum() with a generator expression:

with open('data.txt') as f:
    print sum(1 for _ in f)

Note that you cannot use len(f), since f is an iterator. _ is a special variable name for throwaway variables, see What is the purpose of the single underscore "_" variable in Python?.

You can use len(f.readlines()), but this will create an additional list in memory, which won't even work on huge files that don't fit in memory.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 2
    So pythonic, so very pythonic :O – SARose Apr 08 '17 at 00:47
  • Would it be more expeditious if you wrote it as with open('data.txt') as f: print sum([1 for _ in f])? – jimh Jul 16 '17 at 10:00
  • 1
    @jimh - it is better to use just `sum(1 for _ in f)` because it implicitly uses a generator expression within the parentheses and does not create a list of 1s. However, your version `sum([1 for _ in f])` would create a list of 1s before summing them, which allocates memory unnecessarily. – blokeley Nov 25 '17 at 10:19
  • @blokeley is it faster at the expense of memory is my question – jimh Nov 27 '17 at 18:30
  • @jimh There is no such tradeoff here. The generator expression will be doing less since it doesn't have to spend time allocating memory. A comprehension can be an optimization in case you can reuse the allocated list or dict. – ferrix Feb 21 '19 at 07:23
23

This link (How to get line count cheaply in Python?) has lots of potential solutions, but they all ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering.

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawpycount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

Here are my timings:

rawpycount        0.0048  0.0046   1.00
bufcount          0.0074  0.0066   1.43
wccount             0.01    0.01   2.17
itercount          0.014   0.014   3.04
opcount            0.021    0.02   4.43
kylecount          0.023   0.021   4.58
simplecount        0.022   0.022   4.81
mapcount           0.038   0.032   6.82

I would post it there, but I'm a relatively new user to stack exchange and don't have the requisite manna.

EDIT:

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)

def rawbigcount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen if buf )
Community
  • 1
  • 1
Michael Bacon
  • 2,600
  • 1
  • 12
  • 8
  • 4
    Thanks! This itertool implementation is blazing fast and lets me give a percentage of completion as a very large file is read. – Karl Henselin Dec 30 '14 at 19:33
  • I'm getting an error: AttributeError: 'file' object has no attribute 'raw'. Any ideas why? – MD004 Dec 03 '15 at 21:06
  • The code here is python 3 specific, and the raw/unicode split happened there. My python 2 memory is not good at this point, but if you're using python 2, I think if you change the mode on the open() call to 'r' and just change "f.raw.read()" to "f.read()" you'll effectively get the same thing in python 2. – Michael Bacon Dec 07 '15 at 15:30
  • Would changing the return statement in the first example to `return sum(map(methodcaller("count", b'\n'), f_gen))`, importing `methodcaller` from `operator` help speed this up any (`'imap` from `itertools` as well if python2)? I would also constify the `1024*1024` math to save a few extra cycles. Would like to see the comparison with the second example as well. – Kumba Aug 05 '18 at 22:46
  • Excellent answer and worked great for me. However I changed the first line in rawbigcount to be `with open(filename, 'rb') as f:` – Sverrir Sigmundarson Jan 26 '23 at 21:13
8

You can use sum() with a generator expression here. The generator expression will be [1, 1, ...] up to the length of the file. Then we call sum() to add them all together, to get the total count.

with open('text.txt') as myfile:
    count = sum(1 for line in myfile)

It seems by what you have tried that you don't want to include empty lines. You can then do:

with open('text.txt') as myfile:
    count = sum(1 for line in myfile if line.rstrip('\n'))
TerryA
  • 58,805
  • 11
  • 114
  • 143
5
count=0
with open ('filename.txt','rb') as f:
    for line in f:
        count+=1

print count
Koustav Ghosal
  • 504
  • 1
  • 5
  • 16
2

One liner:

total_line_count = sum(1 for line in open("filename.txt"))

print(total_line_count)
Surya
  • 11,002
  • 4
  • 57
  • 39
0

this one also gives the no.of lines in a file.

a=open('filename.txt','r')
l=a.read()
count=l.splitlines()
print(len(count))
Ghrua
  • 6,746
  • 5
  • 17
  • 25
Naveen
  • 485
  • 1
  • 5
  • 14
0

Use:

num_lines = sum(1 for line in open('data.txt'))
print(num_lines)

That will work.

0

For the people saying to use with open ("filename.txt","r") as f you can do anyname = open("filename.txt","r")

def main():

    file = open("infile.txt",'r')
    count = 0
    for line in file:
            count+=1

    print (count)

main ()
Michell
  • 1
  • 2
0

here is how you can do it through list comprehension, but this will waste a little bit of your computer's memory as line.strip() has been called twice.

     with open('textfile.txt') as file:
lines =[
            line.strip()
            for line in file
             if line.strip() != '']
print("number of lines =  {}".format(len(lines)))
Amaan
  • 69
  • 2
  • 10
0

I am not new to stackoverflow, just never had an account and usually came here for answers. I can't comment or vote up an answer yet. BUT wanted to say that the code from Michael Bacon above works really well. I am new to Python but not to programming. I have been reading Python Crash Course and there are a few things I wanted to do to break up the reading cover to cover approach. One utility that has uses from an ETL or even data quality perspective would be to capture the row count of a file independently from any ETL. The file has X number of rows, you import into SQL or Hadoop and you end up with X number of rows. You can validate at the lowest level the row count of a raw data file.

I have been playing with his code and doing some testing and this code is very efficient so far. I have created several different CSV files, various sizes, and row counts. You can see my code below and my comments provide the times and details. The code Michael Bacon above provided runs about 6 times faster than the normal Python method of just looping the lines.

Hope this helps someone.


 import time
from itertools import (takewhile,repeat)

def readfilesimple(myfile):

    # watch me whip
    linecounter = 0
    with open(myfile,'r') as file_object:
        # watch me nae nae
         for lines in file_object:
            linecounter += 1

    return linecounter

def readfileadvanced(myfile):

    # watch me whip
    f = open(myfile, 'rb')
    # watch me nae nae
    bufgen = takewhile(lambda x: x, (f.raw.read(1024 * 1024) for _ in repeat(None)))
    return sum(buf.count(b'\n') for buf in bufgen if buf)
    #return linecounter


# ************************************
# Main
# ************************************

#start the clock

start_time = time.time()

# 6.7 seconds to read a 475MB file that has 24 million rows and 3 columns
#mycount = readfilesimple("c:/junk/book1.csv")

# 0.67 seconds to read a 475MB file that has 24 million rows and 3 columns
#mycount = readfileadvanced("c:/junk/book1.csv")

# 25.9 seconds to read a 3.9Gb file that has 3.25 million rows and 104 columns
#mycount = readfilesimple("c:/junk/WideCsvExample/ReallyWideReallyBig1.csv")

# 5.7 seconds to read a 3.9Gb file that has 3.25 million rows and 104 columns
#mycount = readfileadvanced("c:/junk/WideCsvExample/ReallyWideReallyBig1.csv")


# 292.92 seconds to read a 43Gb file that has 35.7 million rows and 104 columns
mycount = readfilesimple("c:/junk/WideCsvExample/ReallyWideReallyBig.csv")

# 57 seconds to read a 43Gb file that has 35.7 million rows and 104 columns
#mycount = readfileadvanced("c:/junk/WideCsvExample/ReallyWideReallyBig.csv")


#stop the clock
elapsed_time = time.time() - start_time


print("\nCode Execution: " + str(elapsed_time) + " seconds\n")
print("File contains: " + str(mycount) + " lines of text.")
S. J.
  • 76
  • 8
0

if you import pandas then you can use the shape function to determine this. Not sure how it performs. Code is as follows:

import pandas as pd
data=pd.read_csv("yourfile") #reads in your file
num_records=[]               #creates an array 
num_records=data.shape       #assigns the 2 item result from shape to the array
n_records=num_records[0]     #assigns number of lines to n_records
ascripter
  • 5,665
  • 12
  • 45
  • 68