I have a few hundred thousand csv files I would all like to apply the same function to. Something like the following dummy function:
def process_single_file(fname):
df = pd.read_csv(fname)
# Pandas and non-pandas processing
df.to_csv(f"./output/{fname}")
As looping over all files individually would take too long, my question is what the most efficient way to schedule and parallelize this execution – no processes are dependent on each other. I started off trying to use python's multiprocessing
:
import multiprocessing
files = sorted(glob.glob("./input/*.csv"))
processes = []
for fname in files:
p = multiprocessing.Process(target=process_file, args=(fname,))
processes.append(p)
p.start()
for process in processes:
process.join()
My computer, however, doesn't seem to like this process as it quickly overloads all CPU's and leading to slow-downs and crashes. Is there a more efficient way to reduce the workload of all CPU's and schedule the tasks such as using Dask
, some Bash
script or changing python
? Thanks in advance.