How to speed up importing multiple csv files, do some cleaning of the data and then blend them together in Python?

Question

I have multiple csv files on which I have to remove 2 rows because they are only NaNs. I want to load the first one, perform the cleaning and then load the second one do the cleaning and concatenate with the first one and so on.

This is the code:

df_result = None
for file in tqdm(files):
    df = pd.read_csv(file)
    df = clean_csv(df)
    df = df.to_numpy()
    try:
        df_result = pd.concat([df_result,df],axis = 'index',ignore_index=True)
    except:
        df_result = df

with clean_csv:

def clean_csv(df):
    df_1 = df.drop(labels = [0,1])
    df_1 = df_1.drop('Start Time', axis = 1)
        
    return df_1

Did you consider to use `dask` or multithreading? Eventually you can save the results from the cleaning to a separate folder too. — rpanai, Oct 27 '22 at 00:33
Does this answer your question? [Is there a way to speed up handling large CSVs and dataframes in python?](https://stackoverflow.com/questions/69153017/is-there-a-way-to-speed-up-handling-large-csvs-and-dataframes-in-python) — Michael Delgado, Oct 27 '22 at 00:45
you can drop the unwanted columns on read using the `usecols` argument. also, definitely add dataframes to a list and then concat once as suggested in Always Sunny's [answer](https://stackoverflow.com/a/74215553/3888719) - this way, you only need to allocate the larger array once instead of over and over again as you add more dataframes. but also, see the tips in the answer I linked to - hard coding data types and enforcing the 'c' engine can really boost performance and catch undesireable type casting on bad inputs — Michael Delgado, Oct 27 '22 at 00:47

score 1 · Answer 1 · answered Oct 27 '22 at 00:32

1

Another way could be by appending the df's to the list and then concatenating after the for loop like this because you are currently doing the concatenation on each iteration(I guess that may slow up your script).

df_result = []
for file in tqdm(files):
    df = pd.read_csv(file, index_col=None, header=0)
    df = clean_csv(df)
    df = df.to_numpy()
    df_result.append(df)

df_final = pd.concat(df_result, axis=0, ignore_index=True)

answered Oct 27 '22 at 00:32

A l w a y s S u n n y

36,497
8
60
103

I prefer using list in these cases too. But given there is still a loop and we are assuming all data can fits in memory we can think about multithreading/multiprocessing. – rpanai Oct 27 '22 at 00:35
1

@rpanai agreed on multithreading, actually, we can also shrink the csv while reading – A l w a y s S u n n y Oct 27 '22 at 00:38

score 0 · Answer 2 · answered Oct 27 '22 at 01:12

Concatenation becomes slower the longer your total string becomes. So, when you try to write a script that adds on small bits of data in iteration like this, it quickly starts to slow down as your concatenated string gets larger. So on the first pass it will need to process 1 line of data, then 2, then 3, etc. with each successive pass having to handle more total data.

A solution I've used in the past is to create chunks of smaller data, and then concatenate the chunks when done to minimize the number of long string passes you have to make. The ideal size for a chunk is the square root of the total size of your data set. So, if for example you have 10,000 lines of data to process, you can concatenate packets 100 lines long each, and then concatenate those 100 line packets onto the final packet.

If you don't break up the data, it means that you are processing a total of 50,000,000 lines of data due to multiple passes over the same data, but by breaking up the data into packets like this, you only end up processing 1,000,000 lines of data.

How to speed up importing multiple csv files, do some cleaning of the data and then blend them together in Python?

2 Answers2