0

I have a general question. It concerns the writing of a large number of large text files. The content of the textfile is based on a dataset and differs for each textfile. The basic question is how to do so most efficiently.

More specifically, I want to perform spatially explicit model runs (crop model). The model requires input files to be in txt format. So if I wish to run the model for a large number of raster cells - I need a textfile for each cell (1000s). The efficiency problem occurs when writing the weather input based on climate projections. They are in daily time steps for up to 100 years - eg. 36500 lines (each 8 variables) extracted from the dataset and written to each textfile.

My first attempt was to create a for loop that loops through each location (i.e. each textfile) and for each textfile loops through each daily climate timestep to create the whole string for the climate file and then write it to the text file (I also tested writing each time step to file but efficiency was similar).

This approach takes ca. 1-2min per file on my (a bit old) machine. For a raster of 70x80 cells ca. 7days. Of course I could scale down the number of locations and select less timesteps - but, nevertheless, I wonder whether there is a more effective way to do this?

As from research I believe that the for loop that strings/writes each line together/to file is the bottleneck, I wonder whether pulling the data into an array or dataframe and then saving to cv would be quicker? Or what do you suggest as the most suitable approach for this operation?

Thank you in advance!

Bests regards, Anton.

Here the code:

Please let me know if I should provide additional code/info etc. as I am new to programming and teaching myself for a month now - apologies is things are a bit messy.

weather_dir =  os.path.join(
                        os.getcwd() 
                        , "data" 
                        , "raw" 
                        , "weather"
                        )

precip45l = glob.glob(weather_dir+"/pr*45*d.nc")
tasmax45l = glob.glob(weather_dir+"/tasmax*45*")
tasmin45l = glob.glob(weather_dir+"/tasmin*45*")
evsps45l = glob.glob(weather_dir+"/evsps*45*")

cdo.mergetime(input=precip45l, output= weather_dir+"/precip45.nc")
cdo.mulc("86400", input=weather_dir+"/precip45.nc"
         , output= weather_dir+"/precip45mm.nc" )
precip45 = Dataset(weather_dir+"/precip45mm.nc")

cdo.mergetime(input= tasmax45l, output= weather_dir+"/tasmax45.nc")   
cdo.subc("273.15", input=weather_dir+"/tasmax45.nc"
         , output= weather_dir+"/tasmax45C.nc" )
tasmax45 = Dataset(weather_dir+"/tasmax45C.nc")

cdo.mergetime(input= tasmin45l, output= weather_dir+"/tasmin45.nc")   
cdo.subc("273.15", input=weather_dir+"/tasmin45.nc"
         , output= weather_dir+"/tasmin45C.nc" )
tasmin45 = Dataset(weather_dir+"/tasmin45C.nc")

cdo.mergetime(input= evsps45l, output= weather_dir+"/evsps45.nc")   
cdo.mulc("86400", input=weather_dir+"/evsps45.nc"
         , output= weather_dir+"/evsps45mm.nc" )
evsps45 = Dataset(weather_dir+"/evsps45mm.nc")

datetime_model = netCDF4.num2date(precip45.variables["time"][:]
                                 , "days since 1949-12-1 00:00:00"
                                 )



def create_weather():
    time_length = range(len(datetime_model))
    file_path = os.path.join(os.getcwd(),"data","input" ,"lat")
    for x in lat:
        for y in lon:
            fh = open(os.path.join(file_path+str(x)+"_lon"+str(y),"Weather.txt"), "w")
            fh.write("%% ---------- Weather input time-series for AquaCropOS ---------- %%\n%%Day\tMonth\tYear\tMinTemp\tMaxTemp\tPrecipitation\tReferenceET%%")
            for i in time_length:
                fh.write(
                        "\n"+str(datetime_model[i].day)
                        +"\t"+str(datetime_model[i].month)
                        +"\t"+str(datetime_model[i].year)
                        +"\t"+str(tasmin45.variables["tasmin"][i][x][y])
                        +"\t"+str(tasmax45.variables["tasmax"][i][x][y])
                        +"\t"+str(precip45.variables["pr"][i][x][y])
                        +"\t"+str(evsps45.variables["evspsblpot"][i][x][y])
                        )
            fh.close

create_weather()      

I used cProfile to check the code:

         21695294 function calls in 137.753 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  100.772  100.772  137.752  137.752 <ipython-input-25-a234aeb2049c>:1(create_weather)
        1    0.000    0.000  137.753  137.753 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 _bootlocale.py:23(getpreferredencoding)
   876576    0.558    0.000    5.488    0.000 _methods.py:37(_any)
   584384    0.292    0.000    3.154    0.000 _methods.py:40(_all)
        2    0.000    0.000    0.000    0.000 codecs.py:185(__init__)
  2629728    2.130    0.000    3.675    0.000 function_base.py:213(iterable)
  1460960    0.562    0.000    3.935    0.000 numeric.py:424(asarray)
        3    0.000    0.000    0.000    0.000 posixpath.py:39(_get_sep)
        3    0.000    0.000    0.000    0.000 posixpath.py:73(join)
   584384    3.395    0.000    7.891    0.000 utils.py:23(_safecast)
   876576    0.565    0.000    0.565    0.000 utils.py:40(_find_dim)
   292192    1.744    0.000    2.227    0.000 utils.py:423(_out_array_shape)
   292192    9.756    0.000   20.609    0.000 utils.py:88(_StartCountStride)
        2    0.000    0.000    0.000    0.000 {built-in method _locale.nl_langinfo}
        1    0.000    0.000  137.753  137.753 {built-in method builtins.exec}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
  2629728    1.546    0.000    1.546    0.000 {built-in method builtins.iter}
  1753153    0.263    0.000    0.263    0.000 {built-in method builtins.len}
   292192    0.214    0.000    0.214    0.000 {built-in method builtins.max}
        2    0.001    0.000    0.001    0.000 {built-in method io.open}
  1460960    3.373    0.000    3.373    0.000 {built-in method numpy.core.multiarray.array}
  1168768    2.158    0.000    2.158    0.000 {built-in method numpy.core.multiarray.empty}
        3    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
        1    0.000    0.000    0.000    0.000 {built-in method posix.getcwd}
   584384    1.342    0.000    4.496    0.000 {method 'all' of 'numpy.generic' objects}
  3214112    0.369    0.000    0.369    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        5    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
   584384    0.347    0.000    0.347    0.000 {method 'indices' of 'slice' objects}
   876576    0.375    0.000    0.375    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1460960    7.791    0.000    7.791    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        5    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
    73050    0.199    0.000    0.199    0.000 {method 'write' of '_io.TextIOWrapper' objects}
AntonUrf
  • 21
  • 6
  • Please show some actual code. The approach you are taking sounds like it should be reasonably fast - however it is entirely possible that the issue is *how* you are creating the data lines, rather than writing them. Also consider profiling your code to find the slow points. – match Feb 20 '18 at 09:35
  • Can you also show some examples of the input data? – user3062260 Feb 20 '18 at 09:49
  • I know how personal a project you're working on gets and how much you want to share with us how cool what you're working on is, but this is counterproductive on SO, Show some code and try to abstract away the problem specific details so that you are left with as purely a programmatic problem as can be had. – Veltzer Doron Feb 20 '18 at 09:50
  • Hi will do so immediately. I did not mean to not share on purpose - just thought of the problem as conceptual in nature and wanted to keep things parsimonious. – AntonUrf Feb 20 '18 at 09:57
  • I added the relevant code. I had tried different version of writing to file (e.g. open(aaaaa).write(aaaaaaa) etc. ) etc. but all were similarly slow. – AntonUrf Feb 20 '18 at 10:08
  • Is the time consuming step the output generation or the writing? – Maarten Fabré Feb 20 '18 at 10:38
  • @MaartenFabré how do I find that out? – AntonUrf Feb 20 '18 at 10:41
  • @MaartenFabré I googled a bit and profiles the code. Turns out exec is taking up almost 100% of the time. What does this mean? 1 0.000 0.000 137.753 137.753 {built-in method builtins.exec} – AntonUrf Feb 20 '18 at 11:06

2 Answers2

0

It may help you to test how long each step is taking by doing something like here.

It looks like your code is trying to store lots of information in memory.

You can test this by running this for each read-in;

import sys

sys.getsizeof(object)

Try and rule these out first.

If its a memory issue then try reading in parts of the files in chunks rather than the whole file.

import itertools

with open(file, "r") as text:
    for line in itertools.islice(text, 0, 100):
         # parse line

Appreciate this isn't a full answer but maybe it'll get you started

user3062260
  • 1,584
  • 4
  • 25
  • 53
  • Thanks I'll take a look. Btw, you link does not link to where it's supposed to. – AntonUrf Feb 20 '18 at 11:24
  • I don't think it's a RAM issue though. I monitored RAM usage when running the function and it remains fairly stable at 4GB/12GB. – AntonUrf Feb 20 '18 at 11:26
0

First you'll need to find out whether it is the datageneration or the writing that is taking the time. Easiest would be to seperate the program in logical functions, and use timeit to time these.

calculating the data

There is a lot of repetition in calculating the model, so we can easily abstract that

import csv
from itertools import product
from pathlib import Path


weather_dir = Path('.') / 'data' / 'raw' / 'weather'

measurements = {
    'precipitation': {
        'glob_pattern': "pr*45*d.nc",
        'value' = '86400',
        'input_filename' = 'precip45.nc',
        'output_filename' = 'precip45mm.nc'
    },
    'tasmax': {
        'glob_pattern': "tasmax*45*",
        'value' = '273.15',
        'input_filename' = 'tasmax45.nc',
        'output_filename' = 'tasmax45C.nc'
    },
    'tasmin': {
        'glob_pattern': "tasmin*45*",
        'value' = '273.15',
        'input_filename' = 'tasmin45.nc',
        'output_filename' = 'tasmin45C.nc'
    },
    'evsps45': {
        'glob_pattern': "evsps*45*",
        'value' = '86400',
        'input_filename' = 'evsps45.nc',
        'output_filename' = 'evsps45mm.nc'
    },
}

def get_measurement(cdo, weather_dir, settings):
    input = weather_dir.glob(settings['glob_pattern'])
    temp_file = str(weather_dir / settings['input_filename'])
    out_file = str(weather_dir / settings['input_filename'])
    cdo.mergetime(input=precip45l, output = temp_file)
    cdo.mulc(
        settings['value'], 
        input=temp_file,
        output=out_file,
        )
    return Dataset(out_file)

That part is easily times like this:

times = {}
data = {}
for key, value in measurements.items():
    times[key] = timeit.timeit(
        'data[key] = get_measurement(cdo, weather_dir, value)',
        number=1,
        globals=globals(),
        )

datetime_model = None
times['datetime_model'] = timeit.timeit(
        '''data['datetime_model'] = netCDF4.num2date(
                precip45.variables["time"][:],
                "days since 1949-12-1 00:00:00",
                )''',
        number=1,
        globals=globals(),
        )    

By this abstraction of the calculation, it is also possible to see whether the results have already been calculated. If they are, there is perhaps no reason to do the calculation again

def get_measurement_with_cache(cdo, weather_dir, settings):
    input = weather_dir.glob(settings['glob_pattern'])
    temp_file = str(weather_dir / settings['input_filename'])
    out_file = str(weather_dir / settings['input_filename'])
    if not out_file.exists():
    # You might want to include some of the parameters of the model in the filename to distinguish runs with different parameters
        cdo.mergetime(input=precip45l, output = temp_file)
        cdo.mulc(
            settings['value'], 
            input=temp_file,
            output=out_file,
            )
    return Dataset(out_file)

# writing the dataset

Can be more easily done with csv

output_dir = Path('.') / "data" / "input" / "lat"
def write_output(data, output_dir):
    time_length = range(len(datetime_model))

    for x, y in product(lat, lon):
        output_file = output_dir / f'{x}_lon{y}' / "Weather.txt"  # f-strings are a python 3.6 feature
        with output_file.open('w') as fh:
            fh.write('''%% ---------- Weather input time-series for AquaCropOS ---------- %%
 %%Day\tMonth\tYear\tMinTemp\tMaxTemp\tPrecipitation\tReferenceET%%''')
            writer = csv.writer(fh, seperator='\t')
            for i in time_length
                row = (
                    datetime_model[i].day,
                    datetime_model[i].month,
                    datetime_model[i].year,
                    tasmin45.variables["tasmin"][i][x][y],
                    tasmax45.variables["tasmax"][i][x][y],
                    precip45.variables["pr"][i][x][y],
                    evsps45.variables["evspsblpot"][i][x][y]
                    )
                csvwriter.writerow(row)

That can be timed like this:

 times['writing'] = timeit.timeit(
        'write_output(data)',
        number=1,
        globals=globals(),
        )
Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36