Optimize Netcdf to Python code

Question

I have a python to sql script that reads netcdf file and inserts climatic data to a postgresql table, one row at the time. This of course takes forever, and now I would like to figure out how I can optimize this code. I have been thinking about making a huge list, and then use the copy command. However, I am unsure how one would work that out. Another way might be to write to a csv file and then copy this csv file to the postgres database using the COPY command in Postgresql. I guess that would be quicker than inserting one row at a time.

If you have any suggestions on how this could be optimized, then I would really appreciate it. The netcdf file is available here (need to register though): http://badc.nerc.ac.uk/browse/badc/cru/data/cru_ts/cru_ts_3.21/data/pre

# NetCDF to PostGreSQL database
# CRU-TS 3.21 precipitation and temperature data. From NetCDF to database table
# Requires Python2.6, Postgresql, Psycopg2, Scipy
# Tested using Vista 64bit.

# Import modules
import psycopg2, time, datetime
from scipy.io import netcdf

# Establish connection
db1 = psycopg2.connect("host=192.168.1.162 dbname=dbname user=username password=password")
cur = db1.cursor()
### Create Table
print str(time.ctime())+ " Creating precip table."
cur.execute("DROP TABLE IF EXISTS precip;")
cur.execute("CREATE TABLE precip (gid serial PRIMARY KEY not null, year int, month int, lon decimal, lat decimal, pre decimal);")

### Read netcdf file
f = netcdf.netcdf_file('/home/username/output/project_v2/inputdata/precipitation/cru_ts3.21.1901.2012.pre.dat.nc', 'r')
##
### Create lathash
print str(time.ctime())+ " Looping through lat coords."
temp = f.variables['lat'].data.tolist()
lathash = {}
for entry in temp:
    print str(entry)
    lathash[temp.index(entry)] = entry
##
### Create lonhash
print str(time.ctime())+ " Looping through long coords."
temp = f.variables['lon'].data.tolist()
lonhash = {}
for entry in temp:
    print str(entry)
    lonhash[temp.index(entry)] = entry
##
### Loop through every observation. Set timedimension and lat and long observations.
for _month in xrange(1344):

    if _month < 528:
        print(str(_month))
        print("Not yet")
    else:
        thisyear = int((_month)/12+1901)
        thismonth = ((_month) % 12)+1
        thisdate = datetime.date(thisyear,thismonth, 1)
        print(str(thisdate))
        _time = int(_month)
        for _lon in xrange(720):
            for _lat in xrange(360):
                data = [int(thisyear), int(thismonth), lonhash[_lon], lathash[_lat], f.variables[('pre')].data[_time, _lat, _lon]]
                cur.execute("INSERT INTO precip (year, month, lon, lat, pre) VALUES "+str(tuple(data))+";")


db1.commit()
cur.execute("CREATE INDEX idx_precip ON precip USING btree(year, month, lon, lat, pre);")
cur.execute("ALTER TABLE precip ADD COLUMN geom geometry;")
cur.execute("UPDATE precip SET geom = ST_SetSRID(ST_Point(lon,lat), 4326);")
cur.execute("CREATE INDEX idx_precip_geom ON precip USING gist(geom);")


db1.commit()
cur.close()
db1.close()            
print str(time.ctime())+ " Done!"

In addition to Craig's answer: With the latest PostGIS version you can use the geography type and use that directly in the table definition and load your lon/lat pair into a point on the INSERT. The geography type is also better for any SRID 4326 data as it is geodetic-aware (handy when calculating distances, for instance). Your index idx_precip should only index on year and month; you will never query for the precipitation value by year, month and location. If you want to query on precipitation value by month, make a separate index for that. In general: define your query, then your index. — Patrick, Jun 12 '14 at 09:04
That netcdf file requires a login, but to be honest, I was just curious, and have nothing useful to add to Craig's answer. — John Powell, Jun 12 '14 at 10:59

score 2 · Answer 1 · edited May 23 '17 at 12:05

Use psycopg2's copy_from.

It expects a file-like object, but that can be your own class that reads and processes the input file and returns it on demand via the read() and readlines() methods.

If you're not confident doing that, you could - as you said - generate a CSV tempfile and then COPY that. For very best performance you'd generate the CSV (Python's csv module is useful) then copy it to the server and use server-side COPY thetable FROM '/local/path/to/file', thus avoiding any network overhead.

Most of the time it's easier to use copy ... from stdin via something like psql's \copy or psycopg2's copy_from, and plenty fast enough. Especially if you couple it with producer/consumer feeding via Python's multiprocessing module (not as complicated as it sounds) so your code to parse the input isn't stuck waiting while the database writes rows.

For some more advice on speeding up bulk loading see How to speed up insertion performance in PostgreSQL - but I can see you're already doing at least some of that right, like creating indexes at the end and batching work into transactions.

score 0 · Answer 2 · edited May 23 '17 at 12:21

I had a similar demand, and I rewrote the Numpy array into a PostgreSQL binary input file format. The main drawback is that all columns of the target table need to be inserted, which gets tricky if you need to encode your geometry WKB, however you can use a temporary unlogged table to load the netCDF file into, then select that data into another table with the proper geometry type.

Details here: https://stackoverflow.com/a/8150329/327026

Optimize Netcdf to Python code

2 Answers2