I have a large datasource in which I am trying to enrich the data by creating some calculated columns.
The data source is close to 4 Million rows and I am pulling the data in chunks of 100,000
for field in fields:
operation_start = time.time()
print(f"Operation {y+1}")
chunk[field] = chunk.apply(operations[y], axis = 1)
print("Operation completed in " + str(round(time.time()-operation_start,2)) + " seconds")
operation_start = time.time()
y = y +1
The for loop runs for 9 functions I have defined, each one returning a value for a new column.
Is there a more efficient way in which I capture all the new fields at once, applying everything at once instead of one by one? One important remark is that there are a pair of the functions that need the value created by other functions
I have tried pandarallel but it makes the process even slower.
EDIT:
I managed to make a function and use pandarallel following the info here Apply multiple functions to multiple groupby columns however it does not append the new columns
def Operations(row):
import pandas as pd
d = {}
d["Operation A "] = Operation_A(row)
d["Operation B"] = Operation_B(row, d["Operation A "])
d["Operation C"] = Operation_C(row)
d["Operation D"] = Operation_D(row)
d["Operation E"] = Operation_E(row)
d["Operation F"] = Operation_F(row)
d["Operation G"] = Operation_G(row, d["Operation F"], d["Operation D"], d["Operation B"])
d["Operation H"] = Operation_H(row)
d["Operation I"] = Operation_I(row, d["Operation H"])
return pd.Series(d, index= ["Operation A ", "Operation B", "Operation C", "Operation D", "Operation E", "Operation F", "Operation G", "Operation H", "Operation I"])
chunk.parallel_apply(Operations)
All the Operations make string comparisons and return strings, I cannot provide an example because the functions of all these operations add to more than 400 lines of code :S