Reading a large PostgreSQL table into Python

Question

I have data as below:

ID  date    Net Total       Class
11  201706  XN  0.607500    P
53  201709  M9  0.989722    V
68  201709  FM  3.736944    P
61  201701  ZK  1.121388    B
17  201705  F   1.969722    V

This is a huge table (0.5 billion records) in PosteGreSQL and I need to pull a subset of it into Python as below:

sql_subset = str("""select * from table_name where date in 
                              (201709,201710,201711,201712) and Class = 'V';""")

df_sub = pd.read_sql(sql_subset,cnxn)

Using read_sql gives memory error. This subsetted query is around 0.58 million records and I am looking at a way to import this into python.

Things I have considered:

HDF5 and PyTables - Not sure how to pull data from PostgreSQL into HDF5/PyTables using Python.
Splitting into csv for each month and then combining them into a pandas dataframe - but wouldn't that still give memory error.

Can someone please point me in the right direction for solving this?

Would using "LIMIT" and "OFFSET" be an option? Do you need to work on the whole dataset at the same time, or a smaller portion would be sufficient? — Aleksei Maide, Jan 17 '18 at 08:00
Possible duplicate of https://stackoverflow.com/questions/18107953/how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of — Hari, Jan 17 '18 at 08:04
@AlekseiMaide, I cannot use chunksize or offset , the max filtering I can do is to take it monthwise and then aggregate the data. — Shuvayan Das, Jan 17 '18 at 08:26
@Hari, The question might be a duplicate as you have mentioned, but the other one doesnt have any proper answer either. But thanks for pointing it out. — Shuvayan Das, Jan 17 '18 at 08:27
Check server side cursor: http://initd.org/psycopg/docs/usage.html#server-side-cursors — Clodoaldo Neto, Jan 17 '18 at 11:09

Reading a large PostgreSQL table into Python

0 Answers0