5

I apologise if this has already been asked, but I've read a heap of documentation and am still not sure how to do what I would like to do.

I would like to run a Python script over multiple cores simultaneously.

I have 1800 .h5 files in a directory, with names 'snaphots_s1.h5', 'snapshots_s2.h5' etc, each about 30MB in size. This Python script:

  1. Reads in the h5py files one at a time from the directory.
  2. Extracts and manipulates the data in the h5py file.
  3. Creates plots of the extracted data.

Once this is done, the script then reads in the next h5py file from the directory and follows the same procedure. Hence, none of the processors need to communicate to any other whilst doing this work.

The script is as follows:

import h5py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import cmocean
import os  

from mpi4py import MPI

de.logging_setup.rootlogger.setLevel('ERROR')

# Plot writes

count = 1
for filename in os.listdir('directory'):  ### [PERF] Applied to ~ 1800 .h5 files
    with h5py.File('directory/{}'.format(filename),'r') as file:

         ### Manipulate 'filename' data.  ### [PERF] Each fileI ~ 0.03 TB in size
         ...

         ### Plot 'filename' data.        ### [PERF] Some fileO is output here
         ...
count = count + 1

Ideally, I would like to use mpi4py to do this (for various reasons), though I am open to other options such as multiprocessing.Pool (which I couldn't actually get to work. I tried following the approach outlined here).

So, my question is: What commands do I need to put in the script to parallelise it using mpi4py? Or, if this option isn't possible, how else could I parallelise the script?

Matthew Cassell
  • 219
  • 1
  • 6
  • 15
  • Is there something specific in mpi4py that would rule out multiprocessing.Pool? I am not familiar with h5py or mpi4py, but very familiar with multiprocessing, and to me this seems like a task you would want to split to a pool of workers with just the filename as a parameter. – Hannu Oct 13 '17 at 11:58
  • @Hannu I'm not sure if it will work with the module I'm using. However, if you could explain the multiprocessing module I'll try it out. – Matthew Cassell Oct 13 '17 at 12:26
  • **Want HPC** fabric for this? **[1]:** How many CPU-days does the workpackage processing last end-to-end if being run in a pure [SERIAL]-scheduling? **[2]:** How many files to process x how many [TB] per file does this `< Manipulate 'filename; data > + < Plot 'filename' data >` consist of? **[3]:** How many man-days of human efforts do you plan to spend in total on prototyping and fine-tuning the HPC-part before achieving the approval for the HPC-fabric to run your workpackage? – user3666197 Oct 13 '17 at 13:58
  • @user3666197 I don't know what you mean by HPC fabric. The processing takes about 6 hours when applied to 1800 .h5 files one after another. Each file is about 0.03TB in size. I don't plan to spend very long on getting this working at all. I will probably just learn the multiprocessing module and use that if it works. – Matthew Cassell Oct 13 '17 at 14:04
  • Are you sure? Given the numbers stated above, 0.03E+12 [B] per file within 6 x 60 x 60 ~ 21.600 [sec] / 1800 [1] files make about ~12 [sec] per a file processed. Given a need to just load a file within those 12 [sec] one would need a zero-latency reading channel with more than a 2.33 [GByte/s] without any computation on the data & make no-output at the end. There is something else happening. The HPC-fabric is a vertical-hierarchy of an HPC-infrastructure { HPC-nodes + control-node(s) + HPC-filesystem + HPC-data-distribution-connectivity + HPC-control-plane-connectivity + HPC-workpackage flow } – user3666197 Oct 13 '17 at 15:19
  • @user3666197 I just realised, the 6 hours corresponded to when I had 3600 files, half of which I deleted now because I had too much data. So it actually takes about half that time per file. My apologies. – Matthew Cassell Oct 13 '17 at 22:19

3 Answers3

2

You should go with multiprocessing, and Javier example should work but I would like to break it down so you can understand the steps too.

In general, when working with pools you create a pool of processes that idle until you pass them some work. To ideal way to do it is to create a function that each process will execute separetly.

def worker(fn):
    with h5py.File(fn, 'r') as f:
        # process data..
        return result

That simple. Each process will run this, and return the result to the parent process.

Now that you have the worker function that does the work, let's create the input data for it. It takes a filename, so we need a list of all files

full_fns = [os.path.join('directory', filename) for filename in 
            os.listdir('directory')]

Next initialize the process pool.

import multiprocessing as mp
pool = mp.Pool(4)  # pass the amount of processes you want
results = pool.map(worker, full_fns)  

# pool takes a worker function and input data
# you usually need to wait for all the subprocesses done their work before 
using the data; so you don't work on partial data.

pool.join()
poo.close()

Now you can access your data through results.

for r in results:
    print r

Let me know in comments how this worked out for you

Chen A.
  • 10,140
  • 3
  • 42
  • 61
  • I appreciate the time taken to explain how to use multiprocessing, I'm sure it will be useful for others. That is why I gave you the bounty. However, the solution didn't work for me. When I run the script, the spaceship comes up in the bottom corner of my mac then disappears almost instantaneously and nothing is computed. – Matthew Cassell Oct 22 '17 at 13:09
  • It sounds like the workers don't get their tasks. Debug your worker function using prints. Check if it works with a single process first (using the worker function), and only then expand to multi procs. – Chen A. Oct 22 '17 at 13:20
  • I'd strongly recommend not to join the pool before iterating over the results. This will cause all results to be buffered in memory for no benefit. If you loop over results, it will wait for workers to provide results. – Javier Oct 23 '17 at 16:34
  • @Javier iterating the results before the pool finished processing means you gonna iterate partial data. There might be sub process still running. In order to make sure all sub procs are done, you should join() the pool. – Chen A. Oct 23 '17 at 17:33
  • No, @Vinny. You got that wrong. Perform a trivial test with a sleep function and you'll see. – Javier Oct 24 '17 at 19:13
1

Multiprocessing should not be more complicated than this:

def process_one_file(fn):
    with h5py.File(fn, 'r') as f:
        ....
    return is_successful


fns = [os.path.join('directory', fn) for fn in os.listdir('directory')]
pool = multiprocessing.Pool()
for fn, is_successful in zip(fns, pool.imap(process_one_file, fns)):
    print(fn, "succedded?", is_successful)
Javier
  • 2,752
  • 15
  • 30
1

You should be able to implement multiprocessing easily using the multiprocessing library.

from multiprocessing.dummy import Pool

def processData(files):
    print files
    ...
    return result

allFiles = glob.glob("<file path/file mask>")
pool = Pool(6) # for 6 threads for example
results = pool.map(processData, allFiles)
afsd
  • 152
  • 1
  • 9