I have data as below:
ID date Net Total Class
11 201706 XN 0.607500 P
53 201709 M9 0.989722 V
68 201709 FM 3.736944 P
61 201701 ZK 1.121388 B
17 201705 F 1.969722 V
This is a huge table (0.5 billion records) in PosteGreSQL and I need to pull a subset of it into Python as below:
sql_subset = str("""select * from table_name where date in
(201709,201710,201711,201712) and Class = 'V';""")
df_sub = pd.read_sql(sql_subset,cnxn)
Using read_sql gives memory error. This subsetted query is around 0.58 million records and I am looking at a way to import this into python.
Things I have considered:
- HDF5 and PyTables - Not sure how to pull data from PostgreSQL into HDF5/PyTables using Python.
Splitting into csv for each month and then combining them into a pandas dataframe - but wouldn't that still give memory error.
Can someone please point me in the right direction for solving this?