How to obtain the total numbers of rows from a CSV file in Python?

Question

I'm using python (Django Framework) to read a CSV file. I pull just 2 lines out of this CSV as you can see. What I have been trying to do is store in a variable the total number of rows the CSV also.

How can I get the total number of rows?

file = object.myfilePath
fileObject = csv.reader(file)
for i in range(2):
    data.append(fileObject.next())

I have tried:

len(fileObject)
fileObject.length

What is `file_read`? Is it a file handle (as in `file_read = open("myfile.txt")`? — David Robinson, Apr 19 '13 at 15:50
file_read = csv.reader(file) updated question should make sense now. — GrantU, Apr 19 '13 at 15:50
Have a look at this question for thoughts on that topic: http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python — shredding, Apr 19 '13 at 15:53
This one is simple: http://stackoverflow.com/questions/27504056/row-count-in-a-csv-file — AjayKumarBasuthkar, Mar 16 '16 at 14:28
The accepted answer by @martjin-pieters is correct, but this question is worded poorly. In your pseudocode, you almost certainly want to count the number of *rows* i.e. *records* – as opposed to "Count how many *lines* are in a CSV". Because some CSV datasets may include fields which may be multiline. — dancow, Aug 19 '20 at 04:07
Also the algorithm you use to count the number of records is going to depend on whether or not you are also parsing every record and doing something with each one. If so, just simply count while you're iterating instead of performing an entire "table scan" separately. — trpt4him, Aug 21 '20 at 12:17

Martijn Pieters · Accepted Answer · 2016-08-19T07:13:23.057

242

You need to count the number of rows:

row_count = sum(1 for row in fileObject)  # fileObject is your csv.reader

Using sum() with a generator expression makes for an efficient counter, avoiding storing the whole file in memory.

If you already read 2 rows to start with, then you need to add those 2 rows to your total; rows that have already been read are not being counted.

edited Aug 19 '16 at 07:13

answered Apr 19 '13 at 15:51

Martijn Pieters

1,048,767
296
4,058
3,343

3

Thanks. That will works, but do I have to read the lines first? That seems a bit of a hit? – GrantU Apr 19 '13 at 15:54
9

You *have* to read the lines; the lines are not guaranteed to be a fixed size, so the only way to count them is to read them all. – Martijn Pieters Apr 19 '13 at 15:55
it's weird, because I have a file with more than 4.5 million rows, and this method only counts 53 rows... – Escachator Apr 10 '15 at 21:41
1

@Escachator: what platform are you on? Are there EOF ([CTRL-Z, `\x1A`](http://en.wikipedia.org/wiki/Control-Z)) characters in the file? How did you *open* the file? – Martijn Pieters Apr 10 '15 at 21:45
I am doing the following: file_read = csv.reader('filename') row_count = sum(1 for row in file_read) Don't think there are EOF in the file, just pure "," figures, and \n – Escachator Apr 10 '15 at 21:50
4

@Escachator: Your filename has 53 characters then. The reader takes an iterable or an open file object but not a filename. – Martijn Pieters Apr 10 '15 at 21:51
I see... let me fix that – Escachator Apr 10 '15 at 21:52
1

now it works, and it is super fast! I did: file_read = open('filename') row_count = sum(1 for row in file_read) big thanks! – Escachator Apr 10 '15 at 21:53
15

Note that if you want to then iterate through the reader again (to process the rows, say) then you'll need to reset the iterator, and recreate the reader object: `file.seek(0)` then `fileObject = csv.reader(file)` – KevinTydlacka Jul 12 '18 at 22:05

score 91 · Answer 2 · edited Jun 20 '20 at 09:12

2018-10-29 EDIT

Thank you for the comments.

I tested several kinds of code to get the number of lines in a csv file in terms of speed. The best method is below.

with open(filename) as f:
    sum(1 for line in f)

Here is the code tested.

import timeit
import csv
import pandas as pd

filename = './sample_submission.csv'

def talktime(filename, funcname, func):
    print(f"# {funcname}")
    t = timeit.timeit(f'{funcname}("{filename}")', setup=f'from __main__ import {funcname}', number = 100) / 100
    print('Elapsed time : ', t)
    print('n = ', func(filename))
    print('\n')

def sum1forline(filename):
    with open(filename) as f:
        return sum(1 for line in f)
talktime(filename, 'sum1forline', sum1forline)

def lenopenreadlines(filename):
    with open(filename) as f:
        return len(f.readlines())
talktime(filename, 'lenopenreadlines', lenopenreadlines)

def lenpd(filename):
    return len(pd.read_csv(filename)) + 1
talktime(filename, 'lenpd', lenpd)

def csvreaderfor(filename):
    cnt = 0
    with open(filename) as f:
        cr = csv.reader(f)
        for row in cr:
            cnt += 1
    return cnt
talktime(filename, 'csvreaderfor', csvreaderfor)

def openenum(filename):
    cnt = 0
    with open(filename) as f:
        for i, line in enumerate(f,1):
            cnt += 1
    return cnt
talktime(filename, 'openenum', openenum)

The result was below.

# sum1forline
Elapsed time :  0.6327946722068599
n =  2528244


# lenopenreadlines
Elapsed time :  0.655304473598555
n =  2528244


# lenpd
Elapsed time :  0.7561274056295324
n =  2528244


# csvreaderfor
Elapsed time :  1.5571560935772661
n =  2528244


# openenum
Elapsed time :  0.773000013928679
n =  2528244

In conclusion, sum(1 for line in f) is fastest. But there might not be significant difference from len(f.readlines()).

sample_submission.csv is 30.2MB and has 31 million characters.

Why do you prefer sum() over len() in your conclusion? Len() is faster in your results! — gosuto, Jan 14 '18 at 08:21
Nice answer. One addition. Although slower, one should prefer the `for row in csv_reader:` solution when the CSV is supposed to contain valid quoted newlines according to [rfc4180](https://tools.ietf.org/html/rfc4180). @dixhom how large was the file you've tested? — Simon Lang, Jan 23 '18 at 14:07
Nice one. `sum1forline` could be even faster if the file is opened as `'rb'`. — S3DEV, Jan 21 '21 at 14:26

score 20 · Answer 3 · edited Feb 01 '19 at 09:11

20

To do it you need to have a bit of code like my example here:

file = open("Task1.csv")
numline = len(file.readlines())
print (numline)

I hope this helps everyone.

edited Feb 01 '19 at 09:11

simhumileco

31,877
16
137
115

answered Oct 19 '15 at 13:29

sam collins

309
2
2

2

I like this short answer, but it is slower than Martijn Pieters's. For 10M lines, `%time sum(1 for row in open("df_data_raw.csv"))` cost 4.91s while `%time len(open("df_data_raw.csv").readlines())` cost 14.6s. – Pengju Zhao May 13 '18 at 09:13
1

The original title to the question ("Count how many lines are in a CSV Python") was worded confusingly/misleadingly, since the questioner seems to want the number of rows/records. Your answer would give a wrong number of rows in any dataset in which there are fields with newline characters – dancow Aug 19 '20 at 04:10

score 14 · Answer 4 · answered Jul 15 '16 at 12:48

Several of the above suggestions count the number of LINES in the csv file. But some CSV files will contain quoted strings which themselves contain newline characters. MS CSV files usually delimit records with \r\n, but use \n alone within quoted strings.

For a file like this, counting lines of text (as delimited by newline) in the file will give too large a result. So for an accurate count you need to use csv.reader to read the records.

protti · Answer 5 · 2017-06-01T10:32:36.377

First you have to open the file with open

input_file = open("nameOfFile.csv","r+")

Then use the csv.reader for open the csv

reader_file = csv.reader(input_file)

At the last, you can take the number of row with the instruction 'len'

value = len(list(reader_file))

The total code is this:

input_file = open("nameOfFile.csv","r+")
reader_file = csv.reader(input_file)
value = len(list(reader_file))

Remember that if you want to reuse the csv file, you have to make a input_file.fseek(0), because when you use a list for the reader_file, it reads all file, and the pointer in the file change its position

score 10 · Answer 6 · edited Jan 05 '21 at 01:22

After iterating the whole file with csv.reader() method, you have the total number of lines read, via instance variable line_num:

import csv
with open('csv_path_file') as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        pass
    print(csv_reader.line_num)

Quoting the official documentation:

csvreader.line_num

The number of lines read from the source iterator.

Small caveat:

total number of lines, includes the header, if the CSV has.

score 6 · Answer 7 · answered Mar 10 '18 at 18:03

6

row_count = sum(1 for line in open(filename)) worked for me.

Note : sum(1 for line in csv.reader(filename)) seems to calculate the length of first line

answered Mar 10 '18 at 18:03

Mithilesh Gupta

2,800
1
17
17

4

The first one is counting the number of lines in a file. If your csv has line breaks in strings, it wont show accurate results – Danilo Souza Morães Nov 29 '18 at 22:20

Amir · Answer 8 · 2018-09-22T19:27:33.090

4

This works for csv and all files containing strings in Unix-based OSes:

import os

numOfLines = int(os.popen('wc -l < file.csv').read()[:-1])

In case the csv file contains a fields row you can deduct one from numOfLines above:

numOfLines = numOfLines - 1

edited Sep 22 '18 at 19:27

answered Aug 18 '16 at 23:00

Amir

10,600
9
48
75

This is very handy for integrating into a python script. +1 – Vitalis Aug 15 '20 at 18:16

score 4 · Answer 9 · answered May 17 '19 at 09:55

4

I think we can improve the best answer a little bit, I'm using:

len = sum(1 for _ in reader)

Moreover, we shouldnt forget pythonic code not always have the best performance in the project. In example: If we can do more operations at the same time in the same data set Its better to do all in the same bucle instead make two or more pythonic bucles.

answered May 17 '19 at 09:55

David Martínez

59
3

1

Certainly **a** fastest solution. I'd recommend renaming the `len` variable as it's overwriting the built-in function. – S3DEV Jan 21 '21 at 14:17

score 3 · Answer 10 · answered Apr 19 '13 at 15:53

3

numline = len(file_read.readlines())

answered Apr 19 '13 at 15:53

Alex Troush

677
4
10

2

`file_read` is apparently a `csv.reader()` object, so it does not *have* a `readlines()` method. `.readlines()` has to create a potentially large list, which you then discard again. – Martijn Pieters Apr 19 '13 at 15:54
1

When i write this answer, topic haven't information about csv is csv reader object. – Alex Troush Apr 19 '13 at 16:09

score 3 · Answer 11 · answered Jun 13 '20 at 15:23

3

You can also use a classic for loop:

import pandas as pd
df = pd.read_csv('your_file.csv')

count = 0
for i in df['a_column']:
    count = count + 1

print(count)

answered Jun 13 '20 at 15:23

Arthur Gatignol

63
6

3

If you're reading it as a DataFrame you don't need a loop you can just do `len(df)` – pyjamas Feb 16 '21 at 07:21

score 2 · Answer 12 · answered Jan 25 '16 at 11:45

2

import csv
count = 0
with open('filename.csv', 'rb') as count_file:
    csv_reader = csv.reader(count_file)
    for row in csv_reader:
        count += 1

print count

answered Jan 25 '16 at 11:45

akshaynagpal

2,965
30
32

score 2 · Answer 13 · answered Sep 08 '16 at 17:15

2

Use "list" to fit a more workably object.

You can then count, skip, mutate till your heart's desire:

list(fileObject) #list values

len(list(fileObject)) # get length of file lines

list(fileObject)[10:] # skip first 10 lines

answered Sep 08 '16 at 17:15

Sean

2,412
3
25
31

score 2 · Answer 14 · answered Oct 02 '19 at 18:22

2

import pandas as pd
data = pd.read_csv('data.csv') 
totalInstances=len(data)

answered Oct 02 '19 at 18:22

Sadman Sakib

557
3
10

score 1 · Answer 15 · edited Oct 21 '20 at 06:17

1

might want to try something as simple as below in the command line:

sed -n '$=' filename

or

wc -l filename

edited Oct 21 '20 at 06:17

Parvathirajan Natarajan

1,240
15
29

answered Oct 20 '15 at 05:02

kevin

1,107
1
13
17

1

What if you have line breaks inside double quotes? That should still be considered part of the same record. This answer is wrong – Danilo Souza Morães Nov 29 '18 at 22:21

score 0 · Answer 16 · answered May 20 '21 at 13:23

If you have to parse the CSV (e.g., because of the presence of line breaks in the fields or commented out lines) but the CSV is too large to fit the memory all at once, you might parse the CSV piece-by-piece:

import pandas as pd
import os
import sys

csv.field_size_limit(sys.maxsize)  # increase the maximal line length in pd.read_csv()

cnt = 0
for chunk in pd.read_csv(filepath, chunksize=10**6):
    cnt += len(chunk)
print(cnt)

score 0 · Answer 17 · answered Mar 21 '22 at 09:34

0

I think mine will be the simplest approach here:

import csv
file = open(filename, 'r')
csvfile = csv.reader(file)
file.close
print("row", len(list(csvfile)))

answered Mar 21 '22 at 09:34

swayam dash

1
1

This doesn't work if you do `len(list(csvfile))` followed by `"for index, row in enumerate(csvfile):"`, the `enumerate()` doesn't return any entries. – Samuel Mar 01 '23 at 20:26

score 0 · Answer 18 · answered Dec 23 '22 at 18:08

With pyarrow lib, is almost 6 times faster than dixhom suggested method.

Used: csv with 3,921,865 rows and 927MB file size

Standard

sum(1 for _ in open(file_path))

# result: 3.57 s ± 90.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With pyarrow

import pyarrow.csv as csv

sum([len(chunk) for chunk in csv.open_csv(file_path)])

# result: 854 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

score 0 · Answer 19 · answered Jun 21 '23 at 22:31

0

import pandas as pd
import csv

filename = 'data.csv'

row_count = sum(1 for line in open(filename))

# count no of lines 
print("Number of records : - ",row_count)

The result was : Number of records : - 163210690

answered Jun 21 '23 at 22:31

Nagmat

373
4
14

score -1 · Answer 20 · answered Feb 15 '21 at 15:28

-1

If you are working on a Unix system, the fastest method is the following shell command

cat FILE_NAME.CSV | wc -l

From Jupyter Notebook or iPython, you can use it with a !:

! cat FILE_NAME.CSV | wc -l

answered Feb 15 '21 at 15:28

Abramodj

5,709
9
49
75

Why use 2 commands? `wc -l FILE_NAME.CSV` works just fine. – Robert Benson May 25 '23 at 15:42

score -2 · Answer 21 · answered Mar 18 '19 at 02:28

-2

try

data = pd.read_csv("data.csv")
data.shape

and in the output you can see something like (aa,bb) where aa is the # of rows

answered Mar 18 '19 at 02:28

Ruben Romo

15
2

Just stumbling across stuff, seems this shape comment isn't so bad and actually comparatively very fast: https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe – dedricF Mar 06 '20 at 04:19
2

Oh but you'll want to do a ```data.shape[0]``` – dedricF Mar 06 '20 at 04:20
But is it comparatively fast compared to @martijnpieters's answer, which uses a standard file handle/iterator, and doesn't require installing and importing the pandas library? – dancow Aug 19 '20 at 01:29

How to obtain the total numbers of rows from a CSV file in Python?

21 Answers21

2018-10-29 EDIT

Linked

Related