0

I'm working with a pandas DataFrame that is created from a SQL query involving a join operation on three tables using pd.read_sql(). The DataFrame has about 1 million rows. When I try to query rows based on the value of one column (doi), it takes about 0.1 to 0.2 seconds for each query.

Interestingly, if I export the DataFrame to a CSV file and then load it back into memory, the same query operation is 5 to 10 times faster.

More interestingly, if I only query the main table (the resulting DataFrame has the same number of rows and two short string columns less), the resulting DataFrame is as fast as the CSV one. Even keeping only one column (doi) in the DataFrame obtained from the join operation doesn't improve the performance, and it's still slower than the single-table DataFrame with the same number of rows and more columns.

Could you help me understand this phenomenon and show me how to solve the performance issue without exporting to CSV?

Some clarifications:

I am referring to the difference in the query performance of DataFrames generated by SQL query with/without table joins, not the performance of joining DataFrames.

My codes to read data from the SQL Server database and query the resulting DataFrame:

conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}', server='XXX', database='XXX', uid='XXX', trusted_connection='yes')

query_string = """
    SELECT 
        t1.*,
        t2.short_string_column1,
        t3.short_string_column2
    FROM 
        t1
    LEFT JOIN 
        t2
    ON 
        t1.doi = t2.doi
    LEFT JOIN 
        t3
    ON 
        t1.doi = t3.doi
    """
# ~1M rows
df = pd.read_sql(query_string, conn)

# ~20K dois
doi_list = {some interested dois to query}

# this is very slow
for doi in doi_list:
    # I achieved satisfying performance on the single-table DataFrame 
    # but I needed more fields to do my work, so I have to join tables.
    required_intermediate_results = df[df.doi.values == doi]
    ......

# this is 5 to 10 times faster
df.to_csv('df.csv', index=False)
df2 = pd.read_csv('df.csv')

for doi in doi_list:
    # I achieved satisfying performance on the single-table DataFrame 
    # but I needed more fields to do my work, so I have to join tables.
    required_intermediate_results = df2[df2.doi.values == doi]
    ......
Tom Leung
  • 334
  • 5
  • 18
  • **Comments have been [moved to chat](https://chat.stackoverflow.com/rooms/253668/discussion-on-question-by-tom-leung-why-is-querying-a-pandas-dataframe-slower-af); please do not continue the discussion here.** Before posting a comment below this one, please review the [purposes of comments](/help/privileges/comment). Comments that do not request clarification or suggest improvements usually belong as an [answer](/help/how-to-answer), on [meta], or in [chat]. Comments continuing discussion may be removed. – Samuel Liew May 16 '23 at 01:14

1 Answers1

1

TLDR on the phenomenon; from the experiment below, we can see that pd.read_sql() needs proportionally more memory than pd.read_csv, where the difference in memory usage increases in dataset size. Since the difference easily goes into the Gigabytes for datasets with >1 million rows, this difference can be large enough to affect performance of subsequent operations when using pd.read_sql.

Since both DataFrames use exactly the same amount of memory once dtype parity is assured, and given that the memory overhead from instantiating the DB engine will be independent from the dataset size which is loaded, the results corroborate the hypothesis that pd.read_sql() is proportionally more memory-intensive than pd.read_csv.

Recommendation to OP: Given the identified behaviour of pd.read_sql(), the probably easiest solution is to save the data as a .csv and load it subsequently. Maybe you can divide your program into two separate scripts, the first part loading and storing the data and the second part performing the querying.

Approach:

I used the SO answer here to figure out how much memory Python uses, running the following two pieces of code in a jupyter notebook.

import os
import psutil
import sys

import psycopg2
import pandas as pd
import numpy as np

dbname="postgres"
user="postgres"
password="root"
host="localhost"
port="5432"

connection_string = f"dbname={dbname} user={user} password={password} host={host} port={port}"
conn = psycopg2.connect(connection_string)

df = pd.read_sql("SELECT * FROM staging_spread", con=conn)

process = psutil.Process()
print(df.shape)  # (195976, 6)
print(process.memory_info().rss/(1024*1024))  # 246.49 MB
print(sys.getsizeof(df)/(1024*1024))  # 30.1 MB

And after restarting the kernel:

import os
import psutil
import sys

import psycopg2
import pandas as pd
import numpy as np

df2 = pd.read_csv("example.csv", parse_dates=['crawl_ts', 'market_ts'])

process = psutil.Process()
print(df2.shape)  # (195976, 6)
print(process.memory_info().rss/(1024*1024))  # 112.9 MB
print(sys.getsizeof(df2)/(1024*1024))  # 30.1 MB

The dataset I loaded contains almost 200,000 entries. I did the same experiment using 25k, 50k, 75k, and 100k entries. The results of the total memory required is shown below:

|   N Rows |   Total Memory pd.read_sql() |   Total Memory pd.read_csv() |   Abs Diff |   Diff Increase |
|---------:|-----------------------------:|-----------------------------:|-----------:|----------------:|
|   100000 |                       172.77 |                      101.676 |    71.0942 |         18.3342 |
|    75000 |                       151.3  |                       98.54  |    52.76   |         17.1    |
|    50000 |                       130.26 |                       94.6   |    35.66   |         18.37   |
|    25000 |                       108.56 |                       91.27  |    17.29   |        nan      |

From this table, we can see that for every extra 25,000 rows, the approach using pd.read_sql() requires an additional 18 megabytes. Now if we extrapolate this result for 1 million rows, the pd.read_sql() approach requires 720 MBs more than the pd.read_csv() approach for this dataset. The dataset I used only contained 6 columns - the difference will of course also increase when datasets with more columns are loaded!

Additional Info: df.info() for both datasets showing dataset parity

df.info() for df (created using pd.read_sql())

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195976 entries, 0 to 195975
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   crawl_ts         195976 non-null  datetime64[ns]
 1   crawler_version  195976 non-null  object        
 2   ticker           195976 non-null  object        
 3   market_ts        195976 non-null  datetime64[ns]
 4   bid              195976 non-null  float64       
 5   ask              195976 non-null  float64       
dtypes: datetime64[ns](2), float64(2), object(2)
memory usage: 9.0+ MB
None

df.info() for df2 (created using pd.read_csv())

print(df2.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195976 entries, 0 to 195975
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   crawl_ts         195976 non-null  datetime64[ns]
 1   crawler_version  195976 non-null  object        
 2   ticker           195976 non-null  object        
 3   market_ts        195976 non-null  datetime64[ns]
 4   bid              195976 non-null  float64       
 5   ask              195976 non-null  float64       
dtypes: datetime64[ns](2), float64(2), object(2)
memory usage: 9.0+ MB
None
Simon David
  • 663
  • 3
  • 13
  • @topsail I've just updated my answer based on your comments. First, datasets in both approaches require the same amount of memory when assuring dtype parity. Second, I ran the same analysis for different dataset sizes and we can see that the difference increases proportionally - this corroborates the previously implicit assumption that more data requires more overhead when using `pd.read_sql()` – Simon David May 16 '23 at 05:05
  • @topsail I've added the `df.info()` for both datasets; in short the data revolves around typical stock market data with bid and ask information and some timestamps. – Simon David May 17 '23 at 05:32
  • hmmm. Well when I have time I will investigate too. Its interesting. But I feel I should point out that saving time by 1) querying the database and loading a dataframe, then 2) saving the dataframe to csv, then 3) reloading from dataframe from csv ... well this does not actually eliminate the time spent querying the database (step 1 still has to happen). But it could be useful if you were repeating step 1 a lot for some reason -- you could think of it as a type of caching. Also its unclear if this explains the reason for the "slower" query ... more memory doesn't necessarily mean 5-10x longer. – topsail May 17 '23 at 13:00