1

I've been dealing with a lot of 4-5 Gb csv files last few days at work and so that I know how much they progressed through reading/writing I wrote couple wrapper functions on top of pandas' methods. It all seems to work very well, a bit of overhead but convenience outweighs most issues.

At the same time, when reading a csv, so that the progress bar displays correct percentage, I need to know the number of rows in advance since that determines how many chunks there will be. The simplest solution I came up with is to simply load the 0th column of the csv before starting to load the rest and get its size. But this does take a bit of time when you have files of millions of rows in size.

Also, reading of a single column takes an unreasonably high proportion of total time: reading a single column in a csv with 125 columns a few million rows took ~24 seconds, reading the whole file is 63 seconds.

And this is a function I've been using to read csvs:

def read_csv_with_progressbar(filename: str,
                              chunksize: int = 50000) -> pd.DataFrame:

    length = pd.read_csv(filename, usecols=[0])
    length = length.values.shape[0]

    total = length//chunksize
    chunk_list = []

    chunks = pd.read_csv(filename, chunksize=chunksize)
    with tqdm(total=total, file=sys.stdout) as pbar:
        for chunk in chunks:
            chunk_list.append(chunk)
            pbar.set_description('Reading source csv file')
            pbar.update(1)

        df = pd.concat([i for i in chunk_list], axis=0)

    return df

Any way to get the number of rows in a csv faster that using my flawed method?

NotAName
  • 3,821
  • 2
  • 29
  • 44
  • have you checked this post? https://stackoverflow.com/questions/845058/how-to-get-line-count-of-a-large-file-cheaply-in-python – Mehdi Golzadeh Nov 08 '20 at 23:46
  • 1
    Just tested proposed solutions. Loading a single column with pandas is faster than counting rows with `file.open()`, for example, 24 seconds with pandas vs 36 with the alternative method. – NotAName Nov 08 '20 at 23:59
  • 1
    The honest real answer is that you should switch to a better more structured file format than CSV. No matter what, any algorithm must analyze the full 5GB to accurately answer your query. – orlp Nov 09 '20 at 00:12
  • @orlp, unfortunately, that's not possible. On the other hand, a lot of files I work with have roughly the same format with 125 columns and most of data being floats. I guess the quick and dirty solution would be to estimate number of rows based on file size. – NotAName Nov 09 '20 at 00:51

2 Answers2

3

Assuming there are no quoted strings (with newlines in them) or other shenanigans in your CSV file an accurate (but hacky) solution is to not even parse the file but simply count the number of newlines in the file:

chunk = 1024*1024   # Process 1 MB at a time.
f = np.memmap("test.csv")
num_newlines = sum(np.sum(f[i:i+chunk] == ord('\n'))
                   for i in range(0, len(f), chunk))
del f
orlp
  • 112,504
  • 36
  • 218
  • 315
  • 1
    Thanks! That's a really neat solution! Twice as fast as with pandas and 3 times faster than with `file.open()` with my files. – NotAName Nov 09 '20 at 01:37
  • 2
    @pavel If you want even more speed at the cost of accuracy, you can sample, say, 50MB worth of data semi-randomly across the file and linearly extrapolate that to the whole file. – orlp Nov 09 '20 at 01:43
3

I was dealing with the same problem but the solutions proposed didnt work for me. Dealing with csv files over 20 GB in size the procesing time was still to large for me. Consulting with a co worker I found an almost instant solution using subprocess. It goes like:

import subprocess

num_lines = int(subprocess.check_output("wc -l test.csv", shell=True).split()[0]) - 1

subprocess chek_output returns the number of lines including the header plus the path to the file, split returns the number of lines as a str, int converts to integer, and finally we substract 1 to account for the header.

petezurich
  • 9,280
  • 9
  • 43
  • 57