13

I am trying to do some analysis of the MODIS satellite data. My code primarily reads a lot of files (806) of the dimension 1200 by 1200 (806*1200*1200). It do it using a for loop and perform mathematical operations.

Following is the general way in which I read files.

mindex=np.zeros((1200,1200))
for i in range(1200):
    var1 = xray.open_dataset('filename.nc')['variable'][:,i,:].data
    for j in range(1200):
        var2 = var1[:,j]
        ## Mathematical Calculations to find var3[i,j]## 
        mindex[i,j] = var3[i,j]

Since its a lot of data to handle, the process is very slow and I was considering parallelizing it. I tried doing something with joblib, but I have not been able to do it.

I am unsure how to tackle this problem.

Nirav L Lekinwala
  • 141
  • 1
  • 1
  • 6
  • This is an easy application of multithreading. A tutorial for the standard python library for this can be viewed here: https://pymotw.com/3/threading/ – ritlew Jul 13 '18 at 13:04
  • @ritlew multithreading? mutiprocessing would be more performant to do so since the mathematical calculations could not be multithreaded anyway no? – Mathieu Jul 13 '18 at 13:12
  • Well if there are 806 files, each thread could take a file from a pool and perform relevant calculations on it. That's assuming I understand the problem correctly – ritlew Jul 13 '18 at 13:14
  • @ritlew Funny. I would have sayed each process. – Mathieu Jul 13 '18 at 13:15
  • It's no clear which part of your code does IO and which part is CPU intensive. Your example is also confusing because `open_dataset` is called once for each `i`, so it looks like each open call is getting one row of 1200 values for each of the 806 files. Is is what's happening? – Vincent Jul 13 '18 at 13:17
  • 1
    @ritlew The reason I am saying this, is that because of the GIL, each process can only run one thread at a time. If I recall correctly, multithreading is efficient only when threads are not going to perform pythonic operation (openCV for instance works well with multithreading). – Mathieu Jul 13 '18 at 13:25
  • So this is satellite data which has multiple .hdf files. I use cdo to combine them in an nc file. So the 806 (time instants) are the number of files combined in the .nc file. Now that I have only one file, the data is arranged in a form of a 3D-array which can be considered like a book, with each page containing data of time instant. – Nirav L Lekinwala Jul 15 '18 at 08:55
  • [How to parallelize iteration over a range?](https://stackoverflow.com/q/52636002/9059420) – Darkonaut Dec 15 '18 at 17:09

1 Answers1

13

My guess is that you want to work on several files at the same time. To do so, the best way (in my opinion) is to use multiprocessing. To use this, you need to define an elementary step, and it is already done in your code.

import numpy as np
import multiprocessing as mp
import os

def f(file):
    mindex=np.zeros((1200,1200))
    for i in range(1200):
        var1 = xray.open_dataset(file)['variable'][:,i,:].data
        for j in range(1200):
            var2 = var1[:,j]
            ## Mathematical Calculations to find var3[i,j]## 
            mindex[i,j] = var3[i,j]
    return (file, mindex)


if __name__ == '__main__':
    N= mp.cpu_count()

    files = os.scandir(folder)

    with mp.Pool(processes = N) as p:
        results = p.map(f, [file.name for file in files])

This should return a list of element results in which each element is a tuple with the file name and the mindex matrix. With this, you can work on multiple files at the same time. It is particularly efficient if the computation on each file is long.

Mathieu
  • 5,410
  • 6
  • 28
  • 55
  • 1
    Note that [multithreading sould work fine too](http://scipy-cookbook.readthedocs.io/items/Multithreading.html) if the computational part is done with numpy. – Vincent Jul 13 '18 at 13:26
  • @Vincent True numpy is indeed switching to C. – Mathieu Jul 13 '18 at 13:31
  • @Vincent However, it might need more adaptation of the code especially to do array-wide operation to obtain mindex instead of this double for loop. – Mathieu Jul 13 '18 at 13:38
  • Since most of the time is probably spent during the IO operations and the numpy functions, I would expect `mp.pool.ThreadPool` to achieve the same performance (maybe even better since `mindex` doesn't have to be serialized). This would require a benchmark though. – Vincent Jul 13 '18 at 13:56