5

I need to join all PostgreSQL tables and convert them in a Python dictionary. There are 72 tables in the database. The total number of columns is greater than 1600.

I wrote a simple Python script that joins several tables but fails to join all of them due to the memory error. All memory is occupied during the script execution. And I run the script on a new virtual server with 128GB RAM and 8 CPU. It fails during the lambda function execution.

How could the following code be improved to execute all tables join?

from sqlalchemy import create_engine
import pandas as pd

auth = 'user:pass'
engine = create_engine('postgresql://' + auth + '@host.com:5432/db')

sql_tables = ['table0', 'table1', 'table3', ..., 'table72']        
df_arr = []
[df_arr.append(pd.read_sql_query('select * from "' + table + '"', con=engine)) for table in sql_tables]

df_join = reduce(lambda left, right: pd.merge(left, right, how='outer', on=['USER_ID']), df_arr)
raw_dict = pd.DataFrame.to_dict(df_join.where((pd.notnull(df_join)), 'no_data'))

print(df_join)
print(raw_dict)
print(len(df_arr))

Is it ok to use Pandas for my purpose? Are there better solutions?

The ultimate goal is to denormalize DB data to be able to index it into Elasticsearch as documents, one document per user.

srgbnd
  • 5,404
  • 9
  • 44
  • 80
  • 2
    how many columns do you have in all 72 tables? If your resulting DF will have <= 250 columns then you can and should join them in PostgreSQL - you will not experience memory errors – MaxU - stand with Ukraine Sep 14 '16 at 20:40
  • 1
    Why don't you tell us your ultimate objective so we can help you with a better solution – Neil McGuigan Sep 14 '16 at 20:50
  • @MaxU, I don't remember the exact count but I'm sure the number of columns is greater than 250. – srgbnd Sep 14 '16 at 21:36
  • @NeilMcGuigan I want to denormalize DB data to be able to index it into Elasticsearch as documents, one document per user https://www.elastic.co/guide/en/elasticsearch/guide/current/denormalization.html – srgbnd Sep 14 '16 at 21:41
  • The easiest thing to do is to use a larger machine. 128MB of RAM is not much. Launch something with 4GB and see if this runs (I doubt it will though). You would most likely be better off pulling the data out one row at a time, and probably from one table at a time too. A 72 table join is going to consume a whole lot of memory and swap space on your postgres machine -- and then take up a lot of space in Python with it's representation. – Jonathan Vanasco Sep 15 '16 at 15:25
  • @JonathanVanasco I use server with **128GB** – srgbnd Sep 15 '16 at 16:25
  • Sorry, I misread that. That is a huge join, you need to reformat the data in steps. – Jonathan Vanasco Sep 15 '16 at 16:37

2 Answers2

2

Why don't you create a postgres function instead of script?

Here are some advises that could help you to avoid the memory error:

  • You can use WITH clause which makes better use of your memory.
  • You can create some physical tables for storing the information of different groups of tables of your database. These physical tables will avoid to use a great amount of memory. After that, all you have to do is joining only those physical tables. You can create a function for it.
  • You can create a Data Warehouse by denormalizing the tables you need.
  • Last but not least: Make sure you are using Indexes appropriately.
Luis Teijon
  • 4,769
  • 7
  • 36
  • 57
0

I'm not certain this will help, but you can try pd.concat

raw_dict = pd.concat([d.set_index('USER_ID') for d in df_arr], axis=1)

Or, to get a bit more disctinction

raw_dict = pd.concat([d.set_index('USER_ID') for d in df_arr], axis=1, keys=sql_tables)

If this isn't helpful, let me know and I'll delete it.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I get the following error during `pd.concat` execution: https://gist.github.com/SergeyBondarenko/f78444f709cffec3e6777d3a16e38f2f – srgbnd Sep 15 '16 at 09:31