1

New to Python Data science.

Here I have a sql server extract and I am extracting the data via 'pyodbc.connect' and reading the data by pd.read_sql(.....SQL query) from SQL server.

Here my intention is want to use a list or vector (example below) in SQL query where condition. How I do that? It hleps us not fetching millions of rows into memory.

I like to know how I pass number list and string list (both have different use cases)

1st whare conditions string:

raw_data2 = {'age1': ['ten','twenty']}
df2 = pd.DataFrame(raw_data2, columns = ['age1'])

2nd where condition number:

   raw_data2 = {'age_num': [10,20,30]}
    df3 = pd.DataFrame(raw_data2, columns = ['age_num'])

Thank you for your help and this will reduce our fetch time to 80%

Murali
  • 579
  • 1
  • 6
  • 20

1 Answers1

1

Consider using pandas' read_sql and pass parameters to avoid type handling. Additionally, save all in a dictionary of dataframes with keys corresponding to original raw_data keys and avoid flooding global environment with many sepeate dataframes:

raw_data = {'age1': ['ten','twenty'],
            'age_num': [10, 20, 30]}

df_dict = {}
for k, v in raw_data.items():
   # BUILD PREPARED STATEMENT WITH PARAM PLACEHOLDERS
   where = '{col} IN ({prm})'.format(col=k, prm=", ".join(['?' for _ in v]))
   sql = 'SELECT * FROM mytable WHERE {}'.format(where)       
   print(sql)

   # IMPORT INTO DATAFRAME
   df_dict[k] = pd.read_sql(sql, conn, params = v)

# OUTPUT TOP ROWS OF EACH DF ELEM
df_dict['age1'].head()
df_dict['age_num'].head()

For separate dataframe objects:

def build_query(my_dict):
   for k, v in my_dict.items():
      # BUILD PREPARED STATEMENT WITH PARAM PLACEHOLDERS IN WHERE CLAUSE
      where = '{col} IN ({prm})'.format(col=k, prm=", ".join(['?' for _ in v]))
      sql = 'SELECT * FROM mytable WHERE {}'.format(where)       

   return sql

raw_data2 = {'age1': ['ten','twenty']}
# ASSIGNS QUERY
sql = build_query(raw_data2)
# IMPORT TO DATAFRAME PASSING PARAM VALUES
df2 = pd.read_sql(sql, conn, params = raw_data2['age1'])

raw_data3 = {'age_num': [10,20,30]}
# ASSIGNS QUERY
sql = build_query(raw_data3)
# IMPORT TO DATAFRAME PASSING PARAM VALUES
df3 = pd.read_sql(sql, conn, params = raw_data3['age_num'])
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This looks complicated. I am new and can you provide step by step process? 1st I would like to create a string to pass in where ....example: 10,20,30 etc for numbers. Other function for string....'ten', 'twenty', etc....then 2nd step...I like to pass that sring in where condition – Murali Apr 06 '18 at 21:49
  • The code comments indicates the steps. We are building a parameterized SQL query that is passed into `read_sql` with placeholders, `?`, equal to the number of items in list. We do not have to worry about strings/numbers with [parameterization](https://stackoverflow.com/questions/775296/python-mysql-parameterized-queries/775399)! Also, the loop print out the query: `print(sql)`. – Parfait Apr 07 '18 at 02:35
  • Please try solution again as I used wrong placeholder. Pyodbc uses `?` Report any errors. I also edit to show how you build separate dfs like you originally had. Of course adjust *mytable* in SQL. – Parfait Apr 07 '18 at 02:43
  • Seems like it is working. Let me do one more test...Thank you – Murali Apr 07 '18 at 03:32
  • It is working. In real life, I have object with all those values in dataframe. I don't need to hard code. However, how I make thoses with commas? (like 98,100 etc). This is my object 0 98 1 200 2 113 3 98 4 113 5 64 6 113 7 30 8 113 9 281 Name: site_seqno, dtype: int64 – Murali Apr 07 '18 at 04:16
  • No worry...I figured out. All is good. It is working now. Thank you so much and I learned lot eventhough I am new. – Murali Apr 07 '18 at 05:53