I'm working with a pandas DataFrame that is created from a SQL query involving a join operation on three tables using pd.read_sql()
. The DataFrame has about 1 million rows. When I try to query rows based on the value of one column (doi
), it takes about 0.1 to 0.2 seconds for each query.
Interestingly, if I export the DataFrame to a CSV file and then load it back into memory, the same query operation is 5 to 10 times faster.
More interestingly, if I only query the main table (the resulting DataFrame has the same number of rows and two short string columns less), the resulting DataFrame is as fast as the CSV one. Even keeping only one column (doi
) in the DataFrame obtained from the join operation doesn't improve the performance, and it's still slower than the single-table DataFrame with the same number of rows and more columns.
Could you help me understand this phenomenon and show me how to solve the performance issue without exporting to CSV?
Some clarifications:
I am referring to the difference in the query performance of DataFrames generated by SQL query with/without table joins, not the performance of joining DataFrames.
My codes to read data from the SQL Server database and query the resulting DataFrame:
conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}', server='XXX', database='XXX', uid='XXX', trusted_connection='yes')
query_string = """
SELECT
t1.*,
t2.short_string_column1,
t3.short_string_column2
FROM
t1
LEFT JOIN
t2
ON
t1.doi = t2.doi
LEFT JOIN
t3
ON
t1.doi = t3.doi
"""
# ~1M rows
df = pd.read_sql(query_string, conn)
# ~20K dois
doi_list = {some interested dois to query}
# this is very slow
for doi in doi_list:
# I achieved satisfying performance on the single-table DataFrame
# but I needed more fields to do my work, so I have to join tables.
required_intermediate_results = df[df.doi.values == doi]
......
# this is 5 to 10 times faster
df.to_csv('df.csv', index=False)
df2 = pd.read_csv('df.csv')
for doi in doi_list:
# I achieved satisfying performance on the single-table DataFrame
# but I needed more fields to do my work, so I have to join tables.
required_intermediate_results = df2[df2.doi.values == doi]
......