1

Bit of a puzzler here.

I have a database table that has a set of statistics as columns, a set of averages that are all very similar.

It was really easy to set up a function to iterate over my data set a bunch of times, and update rows in the database a bunch of times.

BUT it's ridiculously slow and a big long script that just does 1 set of updates 1 time is going to take a lot of time to write and will be a pain to maintain.

I was wondering if there is a way to somehow both be OOP but also not do so much redundant seeking / number crunching.

I was wondering if there might be an easier way... SSIS would have let me just pull out the data once, merge it at the end, and throw it back into the database. But I am not sure how I'd do that with Python...

So I defined a function in python that does an average of n days:

def moving_average(conn, cur, dates, days, stockid, col_name):
#    print days
    if (len(dates) < days):
        print "WARNING: Cannot do a " + str(days) + " day moving average for stock id " + str(stock_id) + "!  Not enough data!"
    else:
        for x in range (0, len(dates) - int(days) - 1):
            row_id = dates[x][0]
            numerator = 0
            for i in range (0, days - 1):
                numerator = numerator + dates[x][3]
                moving_average_for_period = round(numerator / days, 2)
                sql = "UPDATE historical_prices SET " + col_name + " = " + str(moving_average_for_period) + " WHERE id = " + str(row_id) + ";"
                cur.execute(sql) 
        conn.commit()

So far so good.

Then I call:

  # now do the moving averages
    moving_average(conn,cur,dates,5,stock_id,'five_day_ma')
    moving_average(conn,cur,dates,10,stock_id,'ten_day_ma')
    moving_average(conn,cur,dates,15,stock_id,'fifteen_day_ma')
    moving_average(conn,cur,dates,30,stock_id,'thirty_day_ma')
    moving_average(conn,cur,dates,60,stock_id,'sixty_day_ma')
    moving_average(conn,cur,dates,120,stock_id,'onetwenty_day_ma')

See my problem? I do my I/O 6x.

Thanks in advance for any help.

Cheers,

Dan

  • It's slow not because you're doing this six times, but because you're doing O(n^2) number of updates in each `moving_average` call. Two ways you can approach this: 1. load `dates` into the database and then write a single query to update the data; 2. load the entire table into memory, operate on it, and then save it back one go. The approach you take depends on the relative sizes of `dates` and your `historical_prices` table. – univerio May 09 '14 at 19:11
  • If you decide to do it in SQL, here's a nice writeup of how to do a moving average in PostgreSQL: http://stackoverflow.com/questions/14749450/moving-average-in-postgresql – univerio May 09 '14 at 19:15
  • The reason I didn't do it in SQL is because at the last shop I was at, we'd have tables with literally millions of rows, and if I did an SQL update it might grind on for HOURS, but if I used SSIS it would take 5 minutes -- literally grab the data from a SELECT, update it in memory, and send it back to the database My guess is SSIS wasn't using UPDATE queries to do that? I would have thought python would have been as fast (or faster) but it seems like I need to think in a different paradigm – user3621274 May 10 '14 at 14:50

1 Answers1

0

While pure SQL often is faster than one-at-a-time updates from Python (or Java, et al) doing an UPDATE across a large table can quickly become exponentially slow depending on its size, even for a relatively small number of rows being updated.

If you do go the pure SQL route (by which I mean doing an UPDATE from Python which impacts a large number of rows) -- and I would definitely recommend that approach in this case -- I would recommend chunking it using LIMIT and OFFSET, such that each UPDATE is much smaller. Postgres is often able to deal with that much better. (I have had cases where an UPDATE that wouldn't complete in one chunk after 5+ hours that completed in 30 min when I did it (via Python) in chunks of 5000.

Of course, if your tables are small enough, you may not run into that issue, but it's something to keep in mind.

khampson
  • 14,700
  • 4
  • 41
  • 43