I've got a pandas dataframe that's about 12,000 rows, and a Postgres table that's about 8 million rows.
I'd like to run a left join on the pandas dataframe against the Postgres table.
Three ways I can think of doing this:
- Load the full Postgres table into a pandas dataframe, then
pd.merge
the two dataframes. Disadvantages: the notebook gets tied up for ages loading 8 million rows, takes up lots of memory. - Convert the pandas dataframe into a temporary Postgres table, run the join in Postgres using psycopg2, capture the results to a dataframe, delete the temporary table. Disadvantages: fiddly.
- Iterate over the pandas dataframe row by row, querying and writing each row individually. Disadvantages: not very pandas-y, and is this even possible?
I'm leaning towards (2), but is there any other approach I haven't thought of, or any other advantages/disadvantages I should consider?