2

I have csv file like below

file1

A B
1 2
3 4

file2

A B
1 2

file3

A B
1 2
3 4
5 6

I would like to count the rows in all the csv file

I tried

f=pd.read_csv(file1)

f.shape

But When I have a lot of csv file ,it takes too much time.

I would like to get the result like below

      rows
file1  2
file2  1
file3  3

How can I get this result?

Heisenberg
  • 4,787
  • 9
  • 47
  • 76

5 Answers5

7

You can create dict of length of all files and then Seriesm for DataFrame add to_frame:

import glob
import pandas as pd

files = glob.glob('files/*.csv')

d = {f: sum(1 for line in open(f)) for f in files}

print (pd.Series(d))

print (pd.Series(d).rename('rows').rename_axis('filename').reset_index())

open does not guarantee the file to be closed properly, so another solution:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

d = {f: file_len(f) for f in files}
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
4

In *nix systems and if you can do it outside of Python:

wc -l *.csv

Should do the trick.

efajardo
  • 797
  • 4
  • 9
  • `subprocess.getoutput("wc -l " + fileName).split()[0]` is about three times faster than `sed -n '$=' "`, BUT ... doesn't count last line in file if the last line doesn't end with LF (line feed) ... – Claudio Apr 10 '17 at 23:55
  • 1
    Do you know how to extract the last char from a file also that FAST, so that it will be possible to get the correct line count from wc -l by +1 if the last char is not LF and +0 if it is? – Claudio Apr 11 '17 at 00:04
  • The POSIX definition of a line is "A sequence of zero or more non- characters plus a terminating character." Don't have an immediate idea on how to treat efficiently files that don't terminate in newline... – efajardo Apr 11 '17 at 12:36
  • The `md5deep` I am using is creating non-POSIX defined files with significant last line not terminated by newline, so skipping the last line from consideration isn't an option - in such case it is necessary to check also for EOF. Anyway, all the methods getting lines from file give all also the last non-LF terminated one not sticking to the POSIX definition? Hmmm ... This would be not the first time it is necessary to set by side the intuitive understanding of things in order to 'agree' with what is ... – Claudio Apr 11 '17 at 12:49
  • If a csv file has a field which allows multiple line in it, this would give the false positive result. – hatirlatici Jun 11 '21 at 14:57
2

For the sake of completeness as a kind of summary of all what was said about speed and proper opening/closing of files here a solution that works FAST and don't need much fancy code, ... limited to *nix systems(?) (but I think similar technique can be used on other systems too).

The code below runs a tiny bit faster then rawincount() and counts also last lines which don't have a '\n' at the end of line (a problem rawincount() has):

import glob, subprocess, pandas
files = glob.glob('files/*.csv') 
d = {f: subprocess.getoutput("sed -n '$=' " + f) for f in files}
print(pandas.Series(d))

P.S. Here some timings I have run on a set of large text files (39 files with a total size of 3.7 GByte, Linux Mint 18.1, Python 3.6). Fascinating is here the timing of the proposed wc -l *.csv method:

    Results of TIMING functions for getting number of lines in a file:
    -----------------------------------------------------------------
            getNoOfLinesInFileUsing_bash_wc :  1.04  !!! doesn't count last non empty line
          getNoOfLinesInFileUsing_bash_grep :  1.59
  getNoOfLinesInFileUsing_mmapWhileReadline :  2.75
           getNoOfLinesInFileUsing_bash_sed :  3.42
 getNoOfLinesInFileUsing_bytearrayCountLF_B :  3.90  !!! doesn't count last non empty line
          getNoOfLinesInFileUsing_enumerate :  4.37
      getNoOfLinesInFileUsing_forLineInFile :  4.49
  getNoOfLinesInFileUsing_sum1ForLineInFile :  4.82      
 getNoOfLinesInFileUsing_bytearrayCountLF_A :  5.30  !!! doesn't count last non empty line
     getNoOfLinesInFileUsing_lenListFileObj :  6.02
           getNoOfLinesInFileUsing_bash_awk :  8.61
Claudio
  • 7,474
  • 3
  • 18
  • 48
1

Try this,

it adds each entry with file name and no.of rows and the columns have appropriate labels :

import os      
df = pd.DataFrame(columns=('file_name', 'rows'))
for index,i in enumerate(os.listdir('.')):
    df.loc[index] = [i,len(pd.read_csv(i).index)]
Claudio
  • 7,474
  • 3
  • 18
  • 48
Rahul K P
  • 15,740
  • 4
  • 35
  • 52
1

The solutions provided so far are not the quickest when working with very large csv's. Also, using open() in a list comprehension does not guarantee the file to be closed properly as e.g. when using with (see this question). So combining that with the insights from this question for speed:

from itertools import takewhile, repeat

def rawincount(filename):
    with open(filename, 'rb') as f:
        bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
        return sum(buf.count(b'\n') for buf in bufgen)

And applying the solution provided by @jezrael:

import glob
import pandas as pd

files = glob.glob('files/*.csv')
d = {f: rawincount(f) for f in files}
df = pd.Series(d).to_frame('rows')
Community
  • 1
  • 1
Roelant
  • 4,508
  • 1
  • 32
  • 62
  • There are two issues with it: #1: it does not return the number of lines in file (counting '\n's doesn't do that) #2: The speed is not really an issue. In my tests: 4.7 sec. for the method used in the question, 3.8 sec. for your function and 4.3 sec. when using `for i, l in enumerate(f): pass`. Anyway I am glad you provided this here. By the way: the `mapcount` method (in link you provided) gives 2.7 sec. on my box (Python 3.6, Linux Mint 18.1) – Claudio Apr 10 '17 at 14:52
  • :) Thanks for checking. In linux I would wonder if calling subprocess with `wc -l` wouldn't be even quicker. – Roelant Apr 11 '17 at 07:10