1

I have data as below:

ID  date    Net Total       Class
11  201706  XN  0.607500    P
53  201709  M9  0.989722    V
68  201709  FM  3.736944    P
61  201701  ZK  1.121388    B
17  201705  F   1.969722    V

This is a huge table (0.5 billion records) in PosteGreSQL and I need to pull a subset of it into Python as below:

sql_subset = str("""select * from table_name where date in 
                              (201709,201710,201711,201712) and Class = 'V';""")

df_sub = pd.read_sql(sql_subset,cnxn)

Using read_sql gives memory error. This subsetted query is around 0.58 million records and I am looking at a way to import this into python.

Things I have considered:

  1. HDF5 and PyTables - Not sure how to pull data from PostgreSQL into HDF5/PyTables using Python.
  2. Splitting into csv for each month and then combining them into a pandas dataframe - but wouldn't that still give memory error.

    Can someone please point me in the right direction for solving this?

Athul Soori
  • 91
  • 1
  • 8
Shuvayan Das
  • 1,198
  • 3
  • 20
  • 40
  • Would using "LIMIT" and "OFFSET" be an option? Do you need to work on the whole dataset at the same time, or a smaller portion would be sufficient? – Aleksei Maide Jan 17 '18 at 08:00
  • 1
    Possible duplicate of https://stackoverflow.com/questions/18107953/how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of – Hari Jan 17 '18 at 08:04
  • @AlekseiMaide, I cannot use chunksize or offset , the max filtering I can do is to take it monthwise and then aggregate the data. – Shuvayan Das Jan 17 '18 at 08:26
  • @Hari, The question might be a duplicate as you have mentioned, but the other one doesnt have any proper answer either. But thanks for pointing it out. – Shuvayan Das Jan 17 '18 at 08:27
  • Check server side cursor: http://initd.org/psycopg/docs/usage.html#server-side-cursors – Clodoaldo Neto Jan 17 '18 at 11:09

0 Answers0