My question is regarding ETL best practices and how to make things a bit simpler/efficient.
I have some data that I pull from an API, which is then transformed. Let's call this "df_upc" and assume it's a Pandas dataframe of UPCs and I'll omit the earlier transformation process for lack of relevancy. This data is pushed to SQL Server via the Pandas to_sql method: creating/updating a table, and then a Select Statement in SQL joins the table and some other data to get current inventory levels. The SQL Server table that holds the inventory levels holds millions of SKUs.
I don't love having to create/update a table every time I want to join some data from the server to my local file and would love a better way to read it in. I've tried querying the data without the join, but then it loads all the millions of rows before joining it in Python and ends up being super slow. Looking for a way to do this in the Python script so it can remain automated. The process works fine for now, I just feel like there's a better way. Thanks for any help/suggestions!
import sqlalchemy, pandas as pd
from sqlalchemy import create_engine
engine1 = sqlalchemy.create_engine('mssql+pyodbc://@SERVER/DATABASE?trusted_connection=yes&driver=SQL+Server')
engine2 = sqlalchemy.create_engine('mssql+pyodbc://@SAME_SERVER/ANOTHER_DATABASE?trusted_connection=yes&driver=SQL+Server')
df_upc.to_sql('myTable', engine1, if_exists='replace', index=False)
#query has been stripped down for this example
query = pd.read_sql_query(
'''SELECT s.qty, s.upc, m.sku
FROM StyleInventory s
INNER JOIN marketing.dbo.myTable m on s.upc = m.upc;''', engine2)
df_joined = pd.DataFrame(query)
df_joined.to_excel("joined.xlsx", index=False)