I'm new to sql/redshift and am looking for the best way to merge to tables.
I have a large(ish) table in redshift (Around 2k cols by 50k rows). The table has a datetime sortkey. I'm pulling the sortkey and 1 other column iteratively into in a python/pandas dataframe performing some (fairly complicated) operations that generate new columns. I then convert this pandas dataframe to another redshift table and I would like to merge it with the original table so that only the new columns are appended (though it wouldn't really matter if the whole thing was updated). Both tables have the same sortkey, this should be fairly straightforward right? Just basically adding some new columns? (please excuse my naivety)
table1
datetime, rainfall, windspeed, cloudcover
2000-01-01,5,5,5
2000-01-02,7,5,5
2000-01-03,1,5,5
2000-01-04,0,5,5
pandasdf1
datetime, rainfall, rainfall_movingAverage, other_calculation
2000-01-01,5,5,NaN
2000-01-02,7,6,4.56
2000-01-03,1,4.3,7.53
2000-01-04,0,3.75,3.55
Here is what I would like to achieve:
datetime, rainfall, windspeed, cloudcover,rainfall_movingAverage, other_calculation
2000-01-01,5,5,5,5,NaN
2000-01-02,7,5,5,6,4.56
2000-01-03,1,5,5,4.3,7.53
2000-01-04,0,5,5,3.75,3.55
In pandas this is very straightforward and can be done in several ways, one way to do it would be:
result = pd.concat([table1, pandasdf1], axis=1, sort=True)
However, the size of the dataframes is making pandas crash and the data will get much larger than it already is so I've had to migrate it to redshift. I've tried to merge using this command:
SELECT * FROM table1
FULL OUTER JOIN pandasdf ON (table1.datetime = pandasdf.datetime)
This seems to work (it doesn't crash or return an error at least) but the original table is not updated. I can't seem to find the additional syntax to just update the original table. Just of note I am using a python sql engine to interact with redshift
import psycopg2
SQL="""
SELECT * FROM table1
FULL OUTER JOIN pandasdf ON (table1.datetime = pandasdf.datetime)
"""
def merge_redshift_tables(SQL):
"""merge the left and right tables"""
success=False
try:
conn=None
"""Establish a connection to redshift"""
conn=psycopg2.connect(dbname= 'mydb', host='myIP', port= 'myport', user= 'myusername', password= 'mypassword')
"""make a cursor object"""
cur = conn.cursor()
cur.execute(SQL)
success=True
except psycopg2.Error as e:
print(e)
finally:
if conn is not None:
conn.close()
return success
If anyone can help me get this step to work then that would be a great first step. However, I'm not sure if this is the best way to do this kind of operation for each of the ~2000 columns so if anyone could share some wisdom about the best practice for that then I would also really appreciate that. I had roughly planned to distribute the work across a number of compute nodes working in parallel however it depends on redshifts ability to merge all of these new columns smoothely (which I'm aware may be an issue). Any advise on best practice in this area is very welcome.
Many thanks
##### EDITThe following seems to run with no errors suggesting that a new table with the desired columns is created successfully:
SELECT t1.*, t2.new_col
INTO TABLE combined FROM table1 t1
LEFT JOIN pandasdf1 t2 ON t1.datetime = t2.datetime;
However when I query it returns an errors suggesting there is no new table:
def get_col(table, col='*'):
"""Gets all data from a column from a table"""
coldata=None
try:
conn=None
"""Establish a connection to redshift"""
conn=psycopg2.connect(dbname= 'mydb', host='myIP', port= 'myport', user= 'myusername', password= 'mypassword')
coldata = pd.read_sql("select {} FROM {}".format(col, table), conn).set_index('gmt_reportedtime').dropna()
except psycopg2.Error as e:
print(e)
finally:
if conn is not None:
conn.close()
return coldata
check = get_col('combined')
returns:
pandas.io.sql.DatabaseError: Execution failed on sql 'select * FROM combined': relation "combined" does not exist
EDIT
I've now fixed this! Withing the python statement the changes needed to be commited:
conn.commit()