Accessing large datasets with Python 3.6, psycopg2 and pandas

Question

I am trying to pull a 1.7G file into a pandas dataframe from a Greenplum postgres data source. The psycopg2 driver takes 8 or so minutes to load. Using the pandas "chunksize" parameter does not help as the psycopg2 driver selects all data into memory, then hands it off to pandas, using a lot more than 2G of RAM.

To get around this, I'm trying to use a named cursor, but all the examples I've found then loop through row by row. And that just seems slow. But the main problem appears to that my SQL just stops working in the named query for some unknown reason.

Goals

load the data as quickly as possible without doing any "unnatural acts"
use SQLAlchemy if possible - used for consistency
have the results in a pandas dataframe for fast in-memory processing (alternatives?)

Have a "pythonic" (elegant) solution. I'd love to do this with a context manager but haven't gotten that far yet.

/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras

/// Connect to database - works
conn_chunky = psycopg2.connect(
    database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
    'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000

/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query) 
/// result returns None (normal behavior) but result is not iterable

df = pd.DataFrame(result.fetchall())

The pandas call returns AttributeError: 'NoneType' object has no attribute 'fetchall' Failure seems due to named cursor being used. Have tried fetchone, fetchmany, etc. Note the goal here is to let the server chunk and serve up the data in large chunks such that there is a balance of bandwidth and CPU usage. Looping through a df = df.append(row) is just plain fugly.

See related questions (not the same issue):

Added standard client side chunking code per request

nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
    if first_loop:
        df = dfx
        first_loop = False
    else:
        df = df.append(dfx,ignore_index=True)

MaxU - stand with Ukraine · Answer 1 · 2016-12-19T21:54:16.430

2

UPDATE:

#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
               ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')

OLD answer:

I'd try to read data from PostgreSQL using internal Pandas method: read_sql():

from sqlalchemy import create_engine
engine = create_engine('postgresql://user@localhost:5432/dbname')

df = pd.read_sql(sql_query, engine)

edited Dec 19 '16 at 21:54

answered Dec 19 '16 at 21:28

MaxU - stand with Ukraine

205,989
36
386
419

That's exactly what I'm doing on my other queries, but if you notice, I create a named cursor, which is neither a connection nor a query. I haven't tried feeding the dataframe a named cursor yet, nor have I found a good example of this. – Harvey Dec 19 '16 at 21:32
@Harvey, i don't quite understand why do you need named cursor - how it's going to help? – MaxU - stand with Ukraine Dec 19 '16 at 21:36
Yeah, that barfed up a hairball: DatabaseError: Execution failed on sql '': argument 1 must be a string or unicode object There were bits everywhere. It was horrible. :) – Harvey Dec 19 '16 at 21:37
The named cursor allows server-side chunking of the data. For some reason that escapes me, the 1.7G database (once in pandas) uses ~16GB of RAM. Other drivers such as pg8000 use 13GB RAM and take ~20 minutes. The idea was to get the server chunking and the pandas processing time interleaved such that they are both busy at the same time. But first I have to get it working. – Harvey Dec 19 '16 at 21:40
@Harvey, i've updated my answer - can you check whether it helps? – MaxU - stand with Ukraine Dec 19 '16 at 21:54
that ran time: 6.05 minutes or 363 seconds and used ~6G Ram (or half what my version uses and 20% faster). Which is great (and thanks for the new trick there!) but I'd like to get named cursors/server side chunking working or at least know what I'm doing wrong. – Harvey Dec 19 '16 at 22:13
The standard behavior is the psycopg2 driver gets all the data in the first fetch, buffers it until complete, then hands it off to pandas, effectively using 2X memory at minimum and taking much longer than it should. – Harvey Dec 20 '16 at 13:36
@Harvey, i don't think psycopg2 is that stupid. If you read in chunks it should fetch one chunk pass it to pandas, read next chunk and so on – MaxU - stand with Ukraine Dec 20 '16 at 14:40
See https://wiki.postgresql.org/wiki/Using_psycopg2_with_PostgreSQL Server Side Cursor code for confirmation. I've found this same detail in several other posts. Its not a bug but a design feature. – Harvey Dec 20 '16 at 15:29

Accessing large datasets with Python 3.6, psycopg2 and pandas

Added standard client side chunking code per request

1 Answers1