0

I'm new to sql/redshift and am looking for the best way to merge to tables.

I have a large(ish) table in redshift (Around 2k cols by 50k rows). The table has a datetime sortkey. I'm pulling the sortkey and 1 other column iteratively into in a python/pandas dataframe performing some (fairly complicated) operations that generate new columns. I then convert this pandas dataframe to another redshift table and I would like to merge it with the original table so that only the new columns are appended (though it wouldn't really matter if the whole thing was updated). Both tables have the same sortkey, this should be fairly straightforward right? Just basically adding some new columns? (please excuse my naivety)

table1

datetime, rainfall, windspeed, cloudcover
2000-01-01,5,5,5
2000-01-02,7,5,5
2000-01-03,1,5,5
2000-01-04,0,5,5

pandasdf1

datetime, rainfall, rainfall_movingAverage, other_calculation
2000-01-01,5,5,NaN
2000-01-02,7,6,4.56
2000-01-03,1,4.3,7.53
2000-01-04,0,3.75,3.55

Here is what I would like to achieve:

datetime, rainfall, windspeed, cloudcover,rainfall_movingAverage, other_calculation
2000-01-01,5,5,5,5,NaN
2000-01-02,7,5,5,6,4.56
2000-01-03,1,5,5,4.3,7.53
2000-01-04,0,5,5,3.75,3.55

In pandas this is very straightforward and can be done in several ways, one way to do it would be:

result = pd.concat([table1, pandasdf1], axis=1, sort=True)

However, the size of the dataframes is making pandas crash and the data will get much larger than it already is so I've had to migrate it to redshift. I've tried to merge using this command:

SELECT * FROM table1
FULL OUTER JOIN pandasdf ON (table1.datetime = pandasdf.datetime)

This seems to work (it doesn't crash or return an error at least) but the original table is not updated. I can't seem to find the additional syntax to just update the original table. Just of note I am using a python sql engine to interact with redshift

import psycopg2
SQL="""
SELECT * FROM table1
FULL OUTER JOIN pandasdf ON (table1.datetime = pandasdf.datetime)
"""
def merge_redshift_tables(SQL):
    """merge the left and right tables"""
    success=False
    try:
        conn=None
        """Establish a connection to redshift"""
        conn=psycopg2.connect(dbname= 'mydb', host='myIP', port= 'myport', user= 'myusername', password= 'mypassword')
        """make a cursor object"""
        cur = conn.cursor()       
        cur.execute(SQL)
        success=True
    except psycopg2.Error as e:
        print(e)
    finally:
        if conn is not None:
            conn.close()

    return success

If anyone can help me get this step to work then that would be a great first step. However, I'm not sure if this is the best way to do this kind of operation for each of the ~2000 columns so if anyone could share some wisdom about the best practice for that then I would also really appreciate that. I had roughly planned to distribute the work across a number of compute nodes working in parallel however it depends on redshifts ability to merge all of these new columns smoothely (which I'm aware may be an issue). Any advise on best practice in this area is very welcome.

Many thanks

##### EDIT

The following seems to run with no errors suggesting that a new table with the desired columns is created successfully:

SELECT t1.*, t2.new_col
INTO TABLE combined FROM table1 t1
LEFT JOIN pandasdf1 t2 ON t1.datetime = t2.datetime;

However when I query it returns an errors suggesting there is no new table:

def get_col(table, col='*'):
    """Gets all data from a column from a table"""
    coldata=None
    try:
        conn=None
        """Establish a connection to redshift"""
        conn=psycopg2.connect(dbname= 'mydb', host='myIP', port= 'myport', user= 'myusername', password= 'mypassword')
        coldata = pd.read_sql("select {} FROM {}".format(col, table), conn).set_index('gmt_reportedtime').dropna()
    except psycopg2.Error as e:
        print(e)
    finally:
        if conn is not None:
            conn.close()
    return coldata

check = get_col('combined')

returns:

pandas.io.sql.DatabaseError: Execution failed on sql 'select * FROM combined': relation "combined" does not exist
EDIT

I've now fixed this! Withing the python statement the changes needed to be commited:

conn.commit()
user3062260
  • 1,584
  • 4
  • 25
  • 53
  • why so many columns? can / should you pivot cols to rows? – Jon Scott Mar 05 '19 at 16:02
  • Its just the size of the data, just to give some background I'm more used to working with dataframes and not databases, I'm getting a sense that they are fundamentally different to each other? – user3062260 Mar 05 '19 at 16:29
  • not fundamentally, but it is usual to normalise tables especially when you have col1 col2 col3 col4 etc - you would create a new column and pivot – Jon Scott Mar 05 '19 at 16:54
  • can you explain a bit what you mean by normalise? – user3062260 Mar 05 '19 at 17:18
  • can you share the names of the cols on your table? and i can use that as an example. – Jon Scott Mar 05 '19 at 17:45
  • maybe just use the example data I posted if thats possible? The header names are not very different - or did you mean the data? Its just rainfall in mm for each day, windspeed is just average windspeed for the day etc. – user3062260 Mar 06 '19 at 11:46

1 Answers1

1

You can that "the original table is not updated" with this command:

SELECT * FROM table1
FULL OUTER JOIN pandasdf ON (table1.datetime = pandasdf.datetime)

A SELECT command in SQL returns data. It does not update data.

If you wish to create a new, combined table, you could use:

SELECT *
INTO TABLE combined
FROM table1
JOIN pandasdf ON (table1.datetime = pandasdf.datetime)

See: SELECT INTO - Amazon Redshift

You need to use a new table because the "original" table1 is only defined as having the original 4 columns. While you could modify the table, add the columns and then run an UPDATE command, creating a new table is a much better idea (and more efficient for Amazon Redshift).

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Thanks for your help, this seems like what I need to do but its giving me an error: pandas.io.sql.DatabaseError: Execution failed on sql 'select * FROM combined': relation "combined" does not exist. Is there a way to create the table dynamically at the same time as populating it? – user3062260 Mar 06 '19 at 11:42
  • the above error occurs when I try and get the 'combined' table back. The above command returns this error: psycopg2.ProgrammingError: relation "combined" does not exist – user3062260 Mar 06 '19 at 12:02
  • Are you committing or auto-committing your changes? See: [Python psycopg2 not inserting into postgresql table](https://stackoverflow.com/a/18237335/174777) – John Rotenstein Mar 06 '19 at 21:34
  • thanks for your help again, there was a no commit statement in the python code which was causing the error, either commit or autocommit makes it work. – user3062260 Mar 08 '19 at 09:57