I apologise if this has already been asked, but I've read a heap of documentation and am still not sure how to do what I would like to do.
I would like to run a Python script over multiple cores simultaneously.
I have 1800 .h5 files in a directory, with names 'snaphots_s1.h5', 'snapshots_s2.h5' etc, each about 30MB in size. This Python script:
- Reads in the h5py files one at a time from the directory.
- Extracts and manipulates the data in the h5py file.
- Creates plots of the extracted data.
Once this is done, the script then reads in the next h5py file from the directory and follows the same procedure. Hence, none of the processors need to communicate to any other whilst doing this work.
The script is as follows:
import h5py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import cmocean
import os
from mpi4py import MPI
de.logging_setup.rootlogger.setLevel('ERROR')
# Plot writes
count = 1
for filename in os.listdir('directory'): ### [PERF] Applied to ~ 1800 .h5 files
with h5py.File('directory/{}'.format(filename),'r') as file:
### Manipulate 'filename' data. ### [PERF] Each fileI ~ 0.03 TB in size
...
### Plot 'filename' data. ### [PERF] Some fileO is output here
...
count = count + 1
Ideally, I would like to use mpi4py to do this (for various reasons), though I am open to other options such as multiprocessing.Pool (which I couldn't actually get to work. I tried following the approach outlined here).
So, my question is: What commands do I need to put in the script to parallelise it using mpi4py? Or, if this option isn't possible, how else could I parallelise the script?