0

I've got a pandas dataframe that's about 12,000 rows, and a Postgres table that's about 8 million rows.

I'd like to run a left join on the pandas dataframe against the Postgres table.

Three ways I can think of doing this:

  1. Load the full Postgres table into a pandas dataframe, then pd.merge the two dataframes. Disadvantages: the notebook gets tied up for ages loading 8 million rows, takes up lots of memory.
  2. Convert the pandas dataframe into a temporary Postgres table, run the join in Postgres using psycopg2, capture the results to a dataframe, delete the temporary table. Disadvantages: fiddly.
  3. Iterate over the pandas dataframe row by row, querying and writing each row individually. Disadvantages: not very pandas-y, and is this even possible?

I'm leaning towards (2), but is there any other approach I haven't thought of, or any other advantages/disadvantages I should consider?

Richard
  • 62,943
  • 126
  • 334
  • 542
  • 1
    You definitely don't want to even try the 3rd option. I have made myself iterations row by row in pandas and it is extremely slow. A workaround to make 1 faster would be to use an online notebook (i.e Google colab or Kaggle) – Javier Lopez Tomas Nov 12 '19 at 15:30
  • #2 for sure. You don't have to manually delete the table, if you create the table in a transaction and use `CREATE TEMPORARY TABLE... ON COMMIT DROP` – Jeremy Nov 12 '19 at 18:10

0 Answers0