1

I'm having trouble with a Python Teradata (tdodbc) query with looping through the same query with different variables and merging the results. I received good direction in another post and ended up here. My issue now is that the dataframe only ends up with query results of the final variable in the loop, "state5". Unfortunately we have 5 states each in their own databases with the same schema. I can run the same query, but want to loop the variables so I can run for all 5 states and return an appended query. This was easy using SAS Macro variables and mending, but need to bring data to python for EDA and data science.

from teradata import tdodbc
udaExec = td.UdaExec(appConfigFile="udaexec.ini")
with udaExec.connect("${dataSourceName}") as session:


    state_dataframes = []
    STATES = ["state1", "state2", "state3", "state4", "state5"]

    for state in STATES:

    query1 = """database my_db_{};"""

    query2 = """      
        select top 10
        '{}' as state
        ,a.*
        from table_a
        """

    session.execute(query1.format(state))
    session.execute(query2.format(state))

    state_dataframes.append(pd.read_sql(query2, session))
    all_states_df = pd.concat(state_dataframes)
CGermain
  • 369
  • 7
  • 18
  • 1
    Seems like instead of doing a separate `session.execute(query2.format(state))` you'd want to put the `.format(state)` inside the `read_sql`. Also indent the `for` loop body down to the `append` and leave the `concat` at the same level as `for`. You could also eliminate the list of dataframes and just `append` each `read_sql` to a result dataframe. – Fred Feb 21 '20 at 16:42
  • Thank you fred! I'm a beginner in pandas and python. Can you show me the code for the last recommendation on eliminating the data frames. – CGermain Feb 21 '20 at 16:46
  • I'm also running into an issue if I use volatile tables and loop through where it will say the volatile table is already in use. Would seem like I need to close the connection after each loop? – CGermain Feb 21 '20 at 18:00
  • You don't have to close the connection; you can explicitly DROP a volatile table or just DELETE the contents for each iteration. Or perhaps better, you could also accumulate data in a volatile table and then do one read_sql to get it into a dataframe; that would probably be fastest overall. It occurs to me now that DataFrame.append does not happen in-place, so the state_dataframes list with one append at the end is probably better. But for completeness: the idea was to start with an empty dataframe and have each iteration do `df = df.append(pd.read_sql(query2.format(state), session))`. – Fred Feb 22 '20 at 23:32
  • Thanks @Fred. I was having issues with dropping the tables and receiving an ET or Null after DDL error. I was able to get it fixed and will post the answer. – CGermain Feb 23 '20 at 00:54
  • @Fred, you *do* want to use list of dataframes. What you recommend is the ill-advised [quadratic copy](https://stackoverflow.com/a/36489724/1422451) running `DataFrame.append` in a `for` loop. Or better is to use one `read_sql` call. – Parfait Feb 23 '20 at 18:57
  • @Parfait - agreed. The list is much better than having DataFrame.append inside the loop. And creating the full result set in the database and using one `read_sql` would be better still. – Fred Feb 23 '20 at 23:30

1 Answers1

0

I was able to finally get this to work although it may not be the most eloquent way to do it. I did try to do the drop tables as a single variable "query5" but was receiving a DDL error. Once I separated each drop table into it's own session.execute, it worked.

udaExec = td.UdaExec(appConfigFile="udaexec.ini")

with udaExec.connect("${dataSourceName}") as session:

    state_dataframes = []
    STATES = ["state1", "state2", "state3", "state4", "state5"]

    for state in STATES:

            query1 = """database my_db_{};"""

            query2 = """   
            create set volatile table v_table
            ,no fallback, no before journal, no after journal as
            (  
            select top 10
            '{}' as state
            ,t.*
            from table t
            )   
            with data
            primary index (dw_key)  
            on commit preserve rows;
            """

            query3 = """
            create set volatile table v_table_2
            ,no fallback, no before journal, no after journal as
            (  
            select t.*
            from v_table t
            )   
            with data
            primary index (dw_key)  
            on commit preserve rows;

            """

            query4 = """

            select t.* 
            from v_table_2 t

            """

            session.execute(query1.format(state))
            session.execute(query2.format(state))
            session.execute(query3)
            state_dataframes.append(pd.read_sql(query4, session))
            session.execute("DROP TABLE v_table")
            session.execute("DROP TABLE v_table_2")

    all_states_df = pd.concat(state_dataframes)

Edit for clarity: correcting the query in the question only required proper indentation. In my Teradata environment I have limited spool space which requires building many vol tables to break apart queries. Since I spent a good amount of time trying to solve this, I added to the answer to help others who may run into this scenario.

CGermain
  • 369
  • 7
  • 18
  • OK. Yes, DROP has to be executed on its own. You don't need to `session.execute(query4)` since it is being run as part of the `read_sql` anyway. And in this (possibly "sanitized") example there's no need to copy one Volatile table to another with a different Primary index. – Fred Feb 23 '20 at 18:42
  • I'm not familiar with Teradata but please explain to current and future readers the need to use volatile tables. Wouldn't proper indentation of `for` loop resolve your original question above? – Parfait Feb 23 '20 at 19:04
  • You're correct. I added vol tables as some users may have spool space limitations and need to break up queries in teradata. I spent a good amount of time solving for this so thought it would be helpful to others. The first query will work with proper indentation. I will edit the answer for that clarity. – CGermain Feb 23 '20 at 19:19