1

I am working on a machine learning project with big data from a hadoop based data warehouse. Therefore, I am using lots of different SQL statements during the day. Sometimes I am modifying the data according to some features and afterwards, I need to feed those variables of dataframe into SQL statement to collect other results from database.

Until now, I've done research on the web but in all related answers you have to indicate each variable as ? or %s in the query. But that method is not feasible for me as sometimes I need to feed more than hundreds of variables so I can not be like

select * from table where id in (?,?,?,.....,?)

My construction to run SQL queries is as follows:

from impala.dbapi import connect
from impala.util import as_pandas
conn = connect(host='0.0.0.0', port=0, auth_mechanism="blank")  
impala_cursor = conn.cursor()

And to run the query

sql= '''SELECT name, id

        FROM table

        WHERE id in (---HERE COMES THE VARIABLES FROM PANDAS DATAFRAME---)
        '''
impala_cursor.execute(sql)
df_var = as_pandas(impala_cursor)

Any help is welcome. Thanks in advance!

Hichem BOUSSETTA
  • 1,791
  • 1
  • 21
  • 27
naïveRSA
  • 113
  • 7
  • 1
    You definitely should use a parametrized query. You can build the raw query string in advance, based on the length of the pd sequence, using the appropriate palaceholder for your driver's paramstyle. Then you pass your pd sequence as parameters to `execute()`. Eases the pain with the correct quoting of the values and adds protection agains SQL injection, although that might not be important if you handle data from trusted sources only. – shmee May 20 '19 at 11:09

0 Answers0