I have 600 csv files each file contains around 1500 rows of data. I have to run a function on the every row of data. I have define the function.
def query_prepare(data):
"""function goes here"""
"""here input data is list of single row of dataframe"""
the above function is perform some function like strip()
, replace()
based on conditions. Above function takes every single row data as list.
data = ['apple$*7','orange ','bananna','-']
.
this is my initial dataframe looklike
a b c d
0 apple$*7 orange bananna -
1 apple()*7 flower] *bananna -
I checked with the function for one row of data processing it takes around 0.04s
. and if I run this on one csv file which contains 1500 row of data it takes almost 1500*0.04s
. I have tried with some of the methods....
# normal in built apply function
t = time.time()
a = df.apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 52.519816637039185
# with swifter
t = time.time()
a = df.swifter.allow_dask_on_strings().apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 160.31028127670288
# with pandarallel
pandarallel.initialize()
t = time.time()
a = df.parallel_apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 55.000578
I did everything with my query_prepare
function to reduce the time so there are no way to change or modify it. Any other suggestion suggestions?
P.S by the way I'm running it on google colab
EDIT: If we have 1500 row data, split it into 15 then apply the function. can we decrease the time by 15 times if we do something like this?. (I'm sorry I'm not sure its possible or not guide me in a good way)