Speeding up reading and operating on 30,000 csv files

Question

I am using Python 3 and pandas(pd.read_csv) to read the files. There are no headers and the separator is ' |, | '. Also, the files are not .csv files and the operating system is CentOS.

There are 30,000 files in a folder with a total size of 10GB. Each file has about 50-100 rows and 1500 columns. I read each csv file (using read_csv) do some operations on it and store them in a list via a for loop. At the end of the process I have a list of dataframes. I was wondering how to speedup the process. Only 10 of the columns are relevant and so I use the argument usecols to filter. The cell inputs are strings so I convert them to float using df.astype(float).

Note that I have to do my operations on each of the files separately and only then append them all together.

I tried to use modin but it led to a mutliple decrease in speed. Also using modin leads to the indexes in each dataframe being repeated multiple times which didn't happen with normal pandas.

How slow is it? How often do you need to repeat that procedure? Have you written or access to the code that creates the files in the first place? What kind of operations do you perform? — Jan Christoph Terasa, Sep 22 '20 at 16:58
@JanChristophTerasa It takes about 15 minutes with normal pandas and over an hour with modin. The operations include casting all values to float and making a new column using pct change. That's it. I have not written the files nor do I have access to the code that creates it. — Adienl, Sep 22 '20 at 17:19
pct change from what to what? It could be faster to just write the loop by hand using normal Python operations (reading CSV is not faster than pure Python, only operating on the arrays. It's hard to tell without files and code. In general, trivial stuff on plaintext files can be done much faster using `awk`. — Jan Christoph Terasa, Sep 22 '20 at 17:23
These are just the initial steps of cleaning the data before I make models. So I do need the data in pandas dataframe. I only want to know if I can read the data faster and store it in pandas Dataframe object. — Adienl, Sep 22 '20 at 17:25
Could be, it will parallelize your IO and all computations, checkout dask delayed, it's pretty handy — João Areias, Sep 22 '20 at 18:00
@JoãoAreias The computations are on like 50-100 rows in each instance of the loop. It is the reading of the file that takes the longest. Will dask help with that? — Adienl, Sep 22 '20 at 18:02
It should. At least the computations will not be blocked by the IO, even if the IO is taking the longest. I'll post an answer with what I would do, worse comes to worst it should take about the same time — João Areias, Sep 22 '20 at 18:10

score 2 · Accepted Answer · answered Sep 22 '20 at 18:13

2

One way of doing this is using Dask delayed. The problem with python and pandas is that it will do everything sequentially which could really slow down your application, especially with a mix of IO intensive and CPU intensive tasks. With Dask you can parallelize the reading and processing of your data, one way I would go about doing this is with the following.

from dask.delayed import delayed
import dask.dataframe as dd
import pandas as pd

file_names = () # Generator with filenames, create your own generator here


@delayed
def read_data(file_name):
    return pd.read_csv(file_name)


@delayed
def process(df):
    # Do the stuff here
    return df


data = [process(read_data(file_name)) for file_name in file_names]
data = dd.compute(data)
print(data)

answered Sep 22 '20 at 18:13

João Areias

1,192
11
41

1

Thanks this reduced the time taken by about 15%. – Adienl Sep 23 '20 at 12:26
Would using a dask dataframe in the read_data function further increase speed? – Adienl Sep 23 '20 at 12:48
Maybe, idk, you would have to try, my guess is that it's not gonna change much from that, but it might. – João Areias Sep 23 '20 at 13:27

Speeding up reading and operating on 30,000 csv files

1 Answers1