0

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.

But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?

This is what i've got so far:

import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'@192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
    for row in result():
        if not row['time'] in uniqueue_days:
            uniqueue_days.append(row['time'])

print(uniqueue_days)

Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them? Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?

Thanks in advance!

Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.

Torxed
  • 22,866
  • 14
  • 82
  • 131

2 Answers2

3

If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.

If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.

Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • I'll consider changing the db-library for sure if there are advantages to it (and by the sound of it, there might be). However that will be down the road because of timeline issues on this particular project. But i would just like to show my gratitude towards your answer since it is helpful and will be used! – Torxed Feb 06 '14 at 08:38
1

You could let the database do all the heavy lifting. Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this

SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;

If you want to strictly enforce sort order on unique_dates returned then do the following:

SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES 
FROM mytable
order by 1;

Usefull references for functions used above:

Date/Time Functions and Operators

Data Type Formatting Functions

If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:

Ex:

'SELECT *  FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j]   ;

Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.

I will leave it for you to figure how to convert date into unix timestamps.

  • Thanks, the first example gave me the result i needed straight off the bat. The second one i already had in mind but hoped for a better solution. Took a minute to calculate the results internally in postgresql which is way better than i'd hoped. So thank you! – Torxed Feb 05 '14 at 09:57
  • 1
    You may want to consider adding an index on `time` field given that you have millions of rows and if you are going to use that field to subset your data. –  Feb 05 '14 at 10:01
  • I've thought about it, but kinda new to how postgresql indexes things and how you set it up, but it's on my todo-list for sure to speed up operations! – Torxed Feb 05 '14 at 10:06