1

My question is regarding ETL best practices and how to make things a bit simpler/efficient.

I have some data that I pull from an API, which is then transformed. Let's call this "df_upc" and assume it's a Pandas dataframe of UPCs and I'll omit the earlier transformation process for lack of relevancy. This data is pushed to SQL Server via the Pandas to_sql method: creating/updating a table, and then a Select Statement in SQL joins the table and some other data to get current inventory levels. The SQL Server table that holds the inventory levels holds millions of SKUs.

I don't love having to create/update a table every time I want to join some data from the server to my local file and would love a better way to read it in. I've tried querying the data without the join, but then it loads all the millions of rows before joining it in Python and ends up being super slow. Looking for a way to do this in the Python script so it can remain automated. The process works fine for now, I just feel like there's a better way. Thanks for any help/suggestions!

import sqlalchemy, pandas as pd
from sqlalchemy import create_engine

engine1 = sqlalchemy.create_engine('mssql+pyodbc://@SERVER/DATABASE?trusted_connection=yes&driver=SQL+Server')
engine2 = sqlalchemy.create_engine('mssql+pyodbc://@SAME_SERVER/ANOTHER_DATABASE?trusted_connection=yes&driver=SQL+Server')

df_upc.to_sql('myTable', engine1, if_exists='replace', index=False)

#query has been stripped down for this example
query = pd.read_sql_query(
  '''SELECT s.qty, s.upc, m.sku
FROM StyleInventory s
INNER JOIN marketing.dbo.myTable m on s.upc = m.upc;''', engine2) 

df_joined = pd.DataFrame(query)

df_joined.to_excel("joined.xlsx", index=False)
BHOC
  • 15
  • 3
  • 1
    Related answer [here](https://stackoverflow.com/a/67561917/2144390). That example uses a TVP for an UPDATE, but the SP could have just as easily done a SELECT. – Gord Thompson May 18 '21 at 21:00
  • How large is the `myTable` (rows and columns)? – van May 19 '21 at 02:16
  • This [question/answer](https://stackoverflow.com/a/67420720/99594) or [Use temp table with SQLAlchemy](https://stackoverflow.com/a/44295473/99594) might give you alternative solution using [CTE](https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql?view=sql-server-ver15) – van May 19 '21 at 02:38
  • @van, about 4000 rows, often only UPC and maybe a SKU column. – BHOC May 19 '21 at 12:14
  • I see. I still would look at the first link in my comment above using `VALUES` construct. Which RDBMS are you using? – van May 19 '21 at 13:28
  • @van it's SQL Server. After further consideration, a temp table would be the best way to go. But how do I push the data to SQL Server from a Pandas Dataframe and then perform joins on the temp table? – BHOC May 25 '21 at 19:02

0 Answers0