While iterating over rows isnt good practice and there can be alternate logics with grouby/transform aggregations etc, but if in worst case you really need to do so, follow the answer. Also, you might not need to reimplement everything here and you can use libraries like Dask, which is built on top of pandas.
But just to give Idea, you can use multiprocessing
(Pool.map
) in combination with chunking
. read csv in chunk(or make chucks as mentioned in the end of answer) and map it to the pools, in processing each chunk add new rows (or add them to list and make new chunk) and return it from the function.
In the end combine the dataframes when all pools are executed.
import pandas as pd
import numpy as np
import multiprocessing
def process_chunk(df_chunk):
for index, row in df_chunk.reset_index(drop = True).iterrows():
#your logic for updating this chunk or making new chunk here
print(row)
print("index is " + str(index))
#if you can added to same df_chunk, return it, else if you appended
#rows to have list_of_rows, make a new df with them and return
#pd.Dataframe(list_of_rows)
return df_chunk
if __name__ == '__main__':
#use all available cores , otherwise specify the number you want as an argument,
#for example if you have 12 cores, leave 1 or 2 for other things
pool = multiprocessing.Pool(processes=10)
results = pool.map(process_chunk, [c for c in pd.read_csv("your_csv.csv", chunksize=7150)])
pool.close()
pool.join()
#make new df by concatenating
concatdf = pd.concat(results, axis=0, ignore_index=True)
Note: Instead of reading csv you can pass chucks by the same logic, to calculate chunk-size you might want something like round_of( (length of df) / (number of core available-2))
eg 100000/14 = round(7142.85) = 7150 rows
per chunk
results = pool.map(process_chunk,
[df[c:c+chunk_size] for c in range(0,len(df),chunk_size])