I have a general question. It concerns the writing of a large number of large text files. The content of the textfile is based on a dataset and differs for each textfile. The basic question is how to do so most efficiently.
More specifically, I want to perform spatially explicit model runs (crop model). The model requires input files to be in txt format. So if I wish to run the model for a large number of raster cells - I need a textfile for each cell (1000s). The efficiency problem occurs when writing the weather input based on climate projections. They are in daily time steps for up to 100 years - eg. 36500 lines (each 8 variables) extracted from the dataset and written to each textfile.
My first attempt was to create a for loop that loops through each location (i.e. each textfile) and for each textfile loops through each daily climate timestep to create the whole string for the climate file and then write it to the text file (I also tested writing each time step to file but efficiency was similar).
This approach takes ca. 1-2min per file on my (a bit old) machine. For a raster of 70x80 cells ca. 7days. Of course I could scale down the number of locations and select less timesteps - but, nevertheless, I wonder whether there is a more effective way to do this?
As from research I believe that the for loop that strings/writes each line together/to file is the bottleneck, I wonder whether pulling the data into an array or dataframe and then saving to cv would be quicker? Or what do you suggest as the most suitable approach for this operation?
Thank you in advance!
Bests regards, Anton.
Here the code:
Please let me know if I should provide additional code/info etc. as I am new to programming and teaching myself for a month now - apologies is things are a bit messy.
weather_dir = os.path.join(
os.getcwd()
, "data"
, "raw"
, "weather"
)
precip45l = glob.glob(weather_dir+"/pr*45*d.nc")
tasmax45l = glob.glob(weather_dir+"/tasmax*45*")
tasmin45l = glob.glob(weather_dir+"/tasmin*45*")
evsps45l = glob.glob(weather_dir+"/evsps*45*")
cdo.mergetime(input=precip45l, output= weather_dir+"/precip45.nc")
cdo.mulc("86400", input=weather_dir+"/precip45.nc"
, output= weather_dir+"/precip45mm.nc" )
precip45 = Dataset(weather_dir+"/precip45mm.nc")
cdo.mergetime(input= tasmax45l, output= weather_dir+"/tasmax45.nc")
cdo.subc("273.15", input=weather_dir+"/tasmax45.nc"
, output= weather_dir+"/tasmax45C.nc" )
tasmax45 = Dataset(weather_dir+"/tasmax45C.nc")
cdo.mergetime(input= tasmin45l, output= weather_dir+"/tasmin45.nc")
cdo.subc("273.15", input=weather_dir+"/tasmin45.nc"
, output= weather_dir+"/tasmin45C.nc" )
tasmin45 = Dataset(weather_dir+"/tasmin45C.nc")
cdo.mergetime(input= evsps45l, output= weather_dir+"/evsps45.nc")
cdo.mulc("86400", input=weather_dir+"/evsps45.nc"
, output= weather_dir+"/evsps45mm.nc" )
evsps45 = Dataset(weather_dir+"/evsps45mm.nc")
datetime_model = netCDF4.num2date(precip45.variables["time"][:]
, "days since 1949-12-1 00:00:00"
)
def create_weather():
time_length = range(len(datetime_model))
file_path = os.path.join(os.getcwd(),"data","input" ,"lat")
for x in lat:
for y in lon:
fh = open(os.path.join(file_path+str(x)+"_lon"+str(y),"Weather.txt"), "w")
fh.write("%% ---------- Weather input time-series for AquaCropOS ---------- %%\n%%Day\tMonth\tYear\tMinTemp\tMaxTemp\tPrecipitation\tReferenceET%%")
for i in time_length:
fh.write(
"\n"+str(datetime_model[i].day)
+"\t"+str(datetime_model[i].month)
+"\t"+str(datetime_model[i].year)
+"\t"+str(tasmin45.variables["tasmin"][i][x][y])
+"\t"+str(tasmax45.variables["tasmax"][i][x][y])
+"\t"+str(precip45.variables["pr"][i][x][y])
+"\t"+str(evsps45.variables["evspsblpot"][i][x][y])
)
fh.close
create_weather()
I used cProfile to check the code:
21695294 function calls in 137.753 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 100.772 100.772 137.752 137.752 <ipython-input-25-a234aeb2049c>:1(create_weather)
1 0.000 0.000 137.753 137.753 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 _bootlocale.py:23(getpreferredencoding)
876576 0.558 0.000 5.488 0.000 _methods.py:37(_any)
584384 0.292 0.000 3.154 0.000 _methods.py:40(_all)
2 0.000 0.000 0.000 0.000 codecs.py:185(__init__)
2629728 2.130 0.000 3.675 0.000 function_base.py:213(iterable)
1460960 0.562 0.000 3.935 0.000 numeric.py:424(asarray)
3 0.000 0.000 0.000 0.000 posixpath.py:39(_get_sep)
3 0.000 0.000 0.000 0.000 posixpath.py:73(join)
584384 3.395 0.000 7.891 0.000 utils.py:23(_safecast)
876576 0.565 0.000 0.565 0.000 utils.py:40(_find_dim)
292192 1.744 0.000 2.227 0.000 utils.py:423(_out_array_shape)
292192 9.756 0.000 20.609 0.000 utils.py:88(_StartCountStride)
2 0.000 0.000 0.000 0.000 {built-in method _locale.nl_langinfo}
1 0.000 0.000 137.753 137.753 {built-in method builtins.exec}
3 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
2629728 1.546 0.000 1.546 0.000 {built-in method builtins.iter}
1753153 0.263 0.000 0.263 0.000 {built-in method builtins.len}
292192 0.214 0.000 0.214 0.000 {built-in method builtins.max}
2 0.001 0.000 0.001 0.000 {built-in method io.open}
1460960 3.373 0.000 3.373 0.000 {built-in method numpy.core.multiarray.array}
1168768 2.158 0.000 2.158 0.000 {built-in method numpy.core.multiarray.empty}
3 0.000 0.000 0.000 0.000 {built-in method posix.fspath}
1 0.000 0.000 0.000 0.000 {built-in method posix.getcwd}
584384 1.342 0.000 4.496 0.000 {method 'all' of 'numpy.generic' objects}
3214112 0.369 0.000 0.369 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
5 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
584384 0.347 0.000 0.347 0.000 {method 'indices' of 'slice' objects}
876576 0.375 0.000 0.375 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1460960 7.791 0.000 7.791 0.000 {method 'reduce' of 'numpy.ufunc' objects}
5 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
73050 0.199 0.000 0.199 0.000 {method 'write' of '_io.TextIOWrapper' objects}