1

I am trying to parameterize some parts of a SQL Query using the below dictionary:

query_params = dict(
        {'target':'status',
         'date_from':'201712',
         'date_to':'201805',
         'drform_target':'NPA'
      })

sql_data_sample = str("""select *
                                 from table_name
                                     where dt = %(date_to)s
                                     and %(target)s in (%(drform_target)s)

                        ----------------------------------------------------
                        union all
                        ----------------------------------------------------

                        (select *,
                                 from table_name
                                     where dt  = %(date_from)s
                                     and %(target)s in ('ACT')
                                     order by random() limit 50000);""")

df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)

However this returns a dataframe with no records at all. I am not sure what the error is since no error is being thrown.

df_data_sample.shape
Out[7]: (0, 1211)

The final PostgreSql query would be:

select *
        from table_name
            where dt = '201805'
            and status in ('NPA')

----------------------------------------------------
union all
----------------------------------------------------
(select *
        from table_name
            where dt  = '201712'
            and status in ('ACT')
            order by random() limit 50000);-- This part of random() is only for running it on my local and not on server.

Below is a small sample of data for replication. The original data has more than a million records and 1211 columns

service_change_3m   service_change_6m   dt  grp_m2          status
0                   -2                  201805  $50-$75     NPA
0                    0                  201805  < $25       NPA
0                   -1                  201805  $175-$200   ACT
0                    0                  201712  $150-$175   ACT
0                    0                  201712  $125-$150   ACT
-1                   1                  201805  $50-$75     NPA

Can someone please help me with this?

UPDATE: Based on suggestion by @shmee.. I am finally using :

target = 'status'
query_params = dict(
        {
         'date_from':'201712',
         'date_to':'201805',
         'drform_target':'NPA'
      })

sql_data_sample = str("""select *
                                 from table_name
                                     where dt = %(date_to)s
                                     and {0} in (%(drform_target)s)

                        ----------------------------------------------------
                        union all
                        ----------------------------------------------------

                        (select *,
                                 from table_name
                                     where dt  = %(date_from)s
                                     and {0} in ('ACT')
                                     order by random() limit 50000);""").format(target)

df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)
Shuvayan Das
  • 1,198
  • 3
  • 20
  • 40
  • your queries are different. the first query has `status in ('ACTIVE')` and your second query has `status in ('ACT')` – aydow Jul 18 '18 at 05:14
  • Hello @aydow.. Sorry, typo.. Updated the question, thanks!! – Shuvayan Das Jul 18 '18 at 05:17
  • no worries. does the query work with a literal string using `pandas`? – aydow Jul 18 '18 at 05:22
  • nope. It is the same empty df even if I enclose the values in the dict in " ". – Shuvayan Das Jul 18 '18 at 05:33
  • i mean have you tried using `select * from table_name where dt = '201805' and status in ('NPA') ---------------------------------------------------- union all ---------------------------------------------------- (select * from table_name where dt = '201712' and status in ('ACT') order by random() limit 50000)` as `sql_data_sample` and not included the `query_params`? – aydow Jul 18 '18 at 05:37
  • oh.. my bad.. yes that is working.. But the table name and dates and target can change and hence there is a need to parameterize these and hence I am trying this approach of passing these through a dict. – Shuvayan Das Jul 18 '18 at 05:39
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/176212/discussion-between-aydow-and-shuvayan-das). – aydow Jul 18 '18 at 05:47
  • The table name for your SELECTs is not actually `table_name`, is it? The other question you pinged me on suggests that it rather is `public.churn_data`. Typo here on SO or do you actually have that in your code? – shmee Jul 18 '18 at 08:04
  • Hello @shmee.. Yes the table_name is just a placeholder for whatever table is to be used there.so it will be public.churn_data in place of table_name. – Shuvayan Das Jul 18 '18 at 08:16
  • Well, this is quite weird. The fact that your dataframe `df_data_sample` has a size matching the number of columns in your table suggests that the query was executed and the cursor does have a description. That would mean that the actual query string is not syntactically broken, but logically. – shmee Jul 18 '18 at 09:04
  • Oh! Wait ...`and %(target)s in ('ACT')` ... You try to set a column name using a parameter! As mentioned in the other question, that does not work. The query now checks if the literal `'status'` is in `('ACT')` resp `('NPA')`. This is always false, hence your result is empty. Try preformatting your query using the `psycopg2.sql` objects `SQL` and `Identifier` as outlined in my EDIT2 in [the other answer](https://stackoverflow.com/a/51374267/4134674). – shmee Jul 18 '18 at 09:09
  • @shmee.. Update the question with current solution which I am using. The column name has to be a parameter since it can change based on the dataset being used so I removed it from the dict and placed it under the variable. – Shuvayan Das Jul 18 '18 at 09:42

1 Answers1

2

Yes, I am quite confident that your issue results from trying to set column names in your query via parameter binding (and %(target)s in ('ACT')) as mentioned in the comments.

This results in your query restricting the result set to records where 'status' in ('ACT') (i.e. Is the string 'status' an element of a list containing only the string 'ACT'?). This is, of course, false, hence no record gets selected and you get an empty result.

This should work as expected:

import psycopg2.sql

col_name = 'status'
table_name = 'public.churn_data'
query_params = {'date_from':'201712',
                'date_to':'201805',
                'drform_target':'NPA'
               }

sql_data_sample = """select * 
                     from {0} 
                     where dt = %(date_to)s 
                     and {1} in (%(drform_target)s)
                     ----------------------------------------------------
                     union all
                     ----------------------------------------------------
                     (select * 
                      from {0} 
                      where dt  = %(date_from)s 
                      and {1} in ('ACT') 
                      order by random() limit 50000);"""

sql_data_sample = sql.SQL(sql_data_sample).format(sql.Identifier(table_name), 
                                                  sql.Identifier(col_name))

df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)
shmee
  • 4,721
  • 2
  • 18
  • 27