8

I am adding a single column to a Postgres table with 100+ columns via Django ( a new migration). How can I update a column in a PostgreSQL table with the data from a pandas data_frame? The pseudo-code for Postgres SQL UPDATE would be:

UPDATE wide_table wt
SET wt.z = df.z
WHERE date = 'todays_date'

The reason for doing it this way is that I am computing a column in the data_frame using a CSV that is in S3 (this is df.z). The docs for Postgres update are straightforward to use, but I am unsure how to do this via Django, sqlalchemy, pyodbc, or the like.

I apologize if this is a bit convoluted. A small and incomplete example would be:

Wide Table (pre-update column z)

identifier    |      x       |      y      |      z       |      date       
foo           |      2       |      1      |     0.0      |      ...           
bar           |      2       |      8      |     0.0      |      ...      
baz           |      3       |      7      |     0.0      |      ...      
foo           |      2       |      8      |     0.0      |      ...      
foo           |      1       |      5      |     0.0      |      ...      
baz           |      2       |      8      |     0.0      |      ...      
bar           |      9       |      3      |     0.0      |      ...      
baz           |      2       |      3      |     0.0      |      ...      

Example Python snippet

def apply_function(identifier):
    # Maps baz-> 15.0, bar-> 19.6, foo -> 10.0 for single date
    df = pd.read_csv("s3_file_path/date_file_name.csv")
    # Compute 'z' based on identifier and S3 csv
    return z

postgres_query = "Select identifier from wide_table"
df = pd.read_sql(sql=postgres_query, con=engine)
df['z'] = df.identifier.apply(apply_function)

# Python / SQL Update Logic here to update Postgres Column
???

Wide Table (post-update column z)

identifier    |      x       |      y      |      z        |      date 
foo           |      2       |      1      |     10.0      |      ...     
bar           |      2       |      8      |     19.6      |      ... 
baz           |      3       |      7      |     15.0      |      ... 
foo           |      2       |      8      |     10.0      |      ... 
foo           |      1       |      5      |     10.0      |      ... 
baz           |      2       |      8      |     15.0      |      ... 
bar           |      9       |      3      |     19.6      |      ... 
baz           |      2       |      3      |     15.0      |      ... 

NOTE: The values in z will change daily so simply creating another table to hold these z values is not a great solution. Also, I'd really prefer to avoid deleting all of the data and adding it back.

Scott Skiles
  • 3,647
  • 6
  • 40
  • 64

2 Answers2

11

Ran into a similar problem and the current accepted solution was too slow for me. My table had 500k+ rows and i needed to update 100k+ rows. After lengthy research and trial and error i arrived at an efficient and correct solution.

The idea is to use psycopg as your writer and to use a temp table. df is your pandas dataframe that contains values you want to set.

import psycopg2

conn = psycopg2.connect("dbname='db' user='user' host='localhost' password='test'")
cur = conn.cursor()

rows = zip(df.id, df.z)
cur.execute("""CREATE TEMP TABLE codelist(id INTEGER, z INTEGER) ON COMMIT DROP""")
cur.executemany("""INSERT INTO codelist (id, z) VALUES(%s, %s)""", rows)

cur.execute("""
    UPDATE table_name
    SET z = codelist.z
    FROM codelist
    WHERE codelist.id = vehicle.id;
    """)

cur.rowcount
conn.commit()
cur.close()
conn.close()
olive_tree
  • 1,417
  • 16
  • 23
  • 1
    Nice solution! Do you know why this is faster than my solution above? – Scott Skiles Jul 03 '19 at 15:17
  • For one, in your soln you are touching rows that don't get updated as part of your data prep. In my case, that is a lot of rows. Probably the bigger optimization here is relying on SQL to build the to-be-set data rather than using python. Also your soln keeps reading csv on every fn call. – olive_tree Jul 03 '19 at 18:50
  • Ah thanks. Yes that `read_csv` in the function is a very silly mistake for sure. I'l be sure to try out your solution if I need this again in the future. – Scott Skiles Jul 03 '19 at 23:54
  • Is this fast for huge amounts of rows, 1M+ ? If I use executemany, I find executemany does 1 UPDATE per row i think, which for me is really slow. Thanks. – Bo Peng Aug 21 '20 at 17:49
4

I managed to cobble together a solution myself where I zip the id and z values and then execute a generic SQL UPDATE statement and utilizing SQL UPDATE FROM VALUES.

Data Prep

sql_query= "SELECT id, a FROM wide_table"
df = pd.read_sql(sql=sql_query, con=engine)
df['z'] = df.a.apply(apply_function)

zipped_vals = zip(df.id, df.z)
tuple_to_str= str(tuple(zipped_vals))
entries_to_update = tuple_to_str[1:len(tuple_to_str)-1] # remove first and last paren in tuple

SQL Query Solution:

# Update column z by matching ID from SQL Table & Pandas DataFrame
update_sql_query = f"""UPDATE wide_table t SET z = v.z
                        FROM (VALUES {entries_to_update}) AS v (id, z)
                        WHERE t.id = v.id;"""

with engine.begin() as conn:
    conn.execute(update_sql_query)

conn.exec(sql_query)

Answer on updating PostgreSQL table column from values

PostgreSQL update docs

Scott Skiles
  • 3,647
  • 6
  • 40
  • 64