I've been dealing with a lot of 4-5 Gb csv files last few days at work and so that I know how much they progressed through reading/writing I wrote couple wrapper functions on top of pandas' methods. It all seems to work very well, a bit of overhead but convenience outweighs most issues.
At the same time, when reading a csv, so that the progress bar displays correct percentage, I need to know the number of rows in advance since that determines how many chunks there will be. The simplest solution I came up with is to simply load the 0th column of the csv before starting to load the rest and get its size. But this does take a bit of time when you have files of millions of rows in size.
Also, reading of a single column takes an unreasonably high proportion of total time: reading a single column in a csv with 125 columns a few million rows took ~24 seconds, reading the whole file is 63 seconds.
And this is a function I've been using to read csvs:
def read_csv_with_progressbar(filename: str,
chunksize: int = 50000) -> pd.DataFrame:
length = pd.read_csv(filename, usecols=[0])
length = length.values.shape[0]
total = length//chunksize
chunk_list = []
chunks = pd.read_csv(filename, chunksize=chunksize)
with tqdm(total=total, file=sys.stdout) as pbar:
for chunk in chunks:
chunk_list.append(chunk)
pbar.set_description('Reading source csv file')
pbar.update(1)
df = pd.concat([i for i in chunk_list], axis=0)
return df
Any way to get the number of rows in a csv faster that using my flawed method?