2

I am working on code that loops over multiple netcdf files (large ~28G). The netcdf files have multiple 4D variables[time, east-west, south-north, height] throughout a domain. The goal is to loop over these files and to loop over each location of all of these variables in the domain and pull certain variables to store into a large array. When there is missing or incomplete files I fill the values with 99.99. Right now I am just testing by looping over 2 daily netcdf files but for some reason it is taking forever (~14 hours). I am not sure if there is a way to optimize this code. I don't think that python should take this long for this task but maybe it is a problem with python or my code. Below is my code hopefully it is readable and any suggestions on how to make this faster is greatly appreciated:

#Domain to loop over
k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month not in month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if os.path.isfile(filename):
                        f = nc.Dataset(filename,'r')
                        times = f.variables['Times'][1:]
                        num_lines = times.shape[0]
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc
HM14
  • 689
  • 1
  • 10
  • 30
  • you can use multiprocessing to process the files at the same time. arrange the k,j,i spaces for different processes and let each of them do its own task – Haifeng Zhang Feb 22 '17 at 04:29
  • What is `nc.Dataset`? Additionally, before you can *improve* speed, you need to know why it is slow. You will need to profile your code and *measure*. – Roland Smith Feb 22 '17 at 04:29
  • it is how I read in netcdf files with python I have a statement earlier in the code not shown here: import netCDF4 as nc – HM14 Feb 22 '17 at 04:31
  • The multi-core suggestion below would help. Also, if you are working in iPython notebook, writing it to a script you run from the command line can MASSIVELY speed things up. 28GB is huge file. If both files are in that size range and you have 3 loops with conditions, 14 hours on a single core is not out of this world, no matter how ridiculous it seems. R is much slower than Python, and smaller files have taken 8-12 to sort through with less looping..just be as conservative as you can with redundant operations and fire up more cores! – sconfluentus Feb 22 '17 at 04:57
  • It seems that you go over the file several times, the {f.variables['Times'][1:]} goes over the file in search of these var. This is done for each one of the loops. Do this once, not at each loop. – Dudi b Feb 22 '17 at 06:18

3 Answers3

2

This is a lame first pass to tighten up your forloops. Since you only use the file shape once per file, you can move the handling outside the loop which should reduce the amount of loading of data in interrupting processing. I still don't get what counter and inc do as they don't seem to be updated in the loop. You definitely want to look into repeated string concatenation performance, or how the performance of your appending to predictors_wrf and names_wrf looks as starting points

k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month not in month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    file_exists = os.path.isfile(filename)
    if file_exists:
        f = nc.Dataset(filename,'r')
        times = f.variables['Times'][1:]
        num_lines = times.shape[0]
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if file_exists:    
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                    else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc
Selecsosi
  • 1,636
  • 19
  • 23
2

For your questions, I think multiprocessing will help a lot. I went through your codes and have some pieces of advice here.

  1. Not using start time, but the filenames as the iterators in your codes.

    Wrap a function to find out all the file names based on time and return a list of all filenames.

    def fileNames(start_date, end_date):
        # Find all filenames.
        cdate = start_date
        fileNameList = [] 
        while cdate <= end_date:
            if cdate.month not in month_keep:
                cdate+=inc
                continue
            yy = cdate.strftime('%Y')        
            mm = cdate.strftime('%m')
            dd = cdate.strftime('%d')
            filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
            fileNameList.append(filename)
            cdate+=inc
    
        return fileNameList
    
  2. Wrap your codes that pull your data and fill with 99.99, the input for the function is the file name.

    def dataExtraction(filename):
        file_exists = os.path.isfile(filename)
        if file_exists:
           f = nc.Dataset(filename,'r')
           times = f.variables['Times'][1:]
           num_lines = times.shape[0]
        for i in i_space:
            for j in j_space:
                for k in k_space:
                    if file_exists:    
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                        else:
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                            counter=counter+1
                        predictors_wrf.append(u)
                        predictors_wrf.append(v)
                        predictors_wrf.append(wspd)
                        predictors_wrf.append(w)
                        predictors_wrf.append(p)
                        predictors_wrf.append(t)
                        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                        names_wrf.append(u_names)
                        names_wrf.append(v_names)
                        names_wrf.append(wspd_names)
                        names_wrf.append(w_names)
                        names_wrf.append(p_names)
                        names_wrf.append(t_names)
    
    
        return zip(predictors_wrf, names_wrf)
    
  3. Using multiprocessing to do your work. Generally, all the computers have more than 1 CPU cores. Multiprocessing will help increase the speed when there are massive CPU calculations. To my previous experience, multiprocessing will reduce up to 2/3 time consumed for huge datasets.

    Updates: After testing my codes an files again on Feb. 25, 2017, I found that using 8 cores for a huge dataset saved me 90% of collapsed time.

    if __name__ == '__main__':
          from multiprocessing import Pool  # This should be in the beginning statements.
          start_date = '01-01-2017'
          end_date = '01-15-2017'
          fileNames = fileNames(start_date, end_date)
          p = Pool(4) # the cores numbers you want to use.
          results = p.map(dataExtraction, fileNames)
          p.close()
          p.join()
    
  4. Finally, be careful about the data structures here as it is pretty complicated. Hope this helps. Please leave comments if you have any further questions.

Wenlong Liu
  • 444
  • 2
  • 13
1

I don't have very many suggestions, but a couple of things to note.

Don't open that file so many times

First, you define this filename variable and then inside this loop (deep inside: three for-loops deep), you are checking if the file exists and presumably opening it there (I don't know what nc.Dataset does, but I'm guessing it must open the file and read it):

filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if os.path.isfile(filename):
                        f = nc.Dataset(filename,'r')

This is going to be pretty inefficient. You can certainly open it once if the file doesn't change before all of your loops.

Try to Use Fewer for-loops

All of these nested for-loops are compounding the number of operations you need to perform. General suggestion: try to use numpy operations instead.

Use CProfile

If you want to know why your programs are taking a long time, one of the best ways to find out is to profile them.

erewok
  • 7,555
  • 3
  • 33
  • 45