1

Issue

I'm trying to read a table in a MS SQL Server using python, specifically SQLalchemy, pymssql, and pandas.read_sql. I want to execute the query, put the results into a pandas Dataframe, and continue on with my day at a reasonable speed, but a rather simple query (output = 100 MB) is taking almost 5-10 minutes on a LAN connection using ethernet cables - no wifi in sight.

The resulting dataframe with 3 columns and 3214086 entries took a whopping 423 seconds. The following info():

Int64Index: 3214086 entries, 0 to 3214085
Data columns (total 3 columns):
DateTime    datetime64[ns]
TagName     object
Value       float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 98.1+ MB
None

If my math is correct, 100 MB in 423 seconds is about 230 kB/s, which I think over an Ethernet connection is just painfully slow. The bottleneck must be in the sql server itself


Query Info

The SQL server itself is a Wonderware set up, which may or may not have something to do with it. In the following query I'm asking for values of 6 different tags over a timespan of 1 month, where the resolution is 5000 ms, or 5 seconds.

Here is what the query itself looks like:

sql_query = '''
SELECT DateTime, TagName, Value

    FROM Runtime.dbo.AnalogHistory

    WHERE
        DateTime BETWEEN '2014-05-26 00:00' AND '2014-06-26 00:00'
    AND 
        TagName IN (
            'Tag1.ActualValue', 
            'Tag2.ActualValue', 
            'Tag3.ActualValue', 
            'Tag4.ActualValue', 
            'Tag5.ActualValue', 
            'Tag6.ActualValue')
    AND
        wwRetrievalMode = 'Cyclic'
    AND
        wwResolution = 5000
'''

And finally, the executeQuery function:

import pandas as pd
import pymssql
import sqlalchemy

def executeQuery(sql_query):
    connection_string = 'mssql+pymssql://user:password@server'
    engine = sqlalchemy.create_engine(connection_string)

    df = pd.read_sql(sql_query, engine)

    if 'DateTime' in df.columns:
        df.DateTime = pd.to_datetime(df.DateTime, infer_datetime_format=True)

    return df

Question

What's going on here? Is this a limitation within one of the packages I'm using, or can I speed things up with what I have?

This question seems related in the fact that the OP was trying to write to an sql server. There it states it's faster to use 'BULK INSERT' instead of through pandas using a .csv file, but that is not a viable option for reading from an sql server. Am I supposed to have a directory containing .csv files of all the date/times and all the tags!? That would negate the use of the sql server in the first place, wouldn't it?

Community
  • 1
  • 1
cbcoutinho
  • 634
  • 1
  • 12
  • 28
  • Can you also post which indices are defined for the `Runtime.dbo.AnalogHistory` table and the execution plan for this query? – MaxU - stand with Ukraine Jun 20 '16 at 16:12
  • I don't know how to do that, is that an option I give to the query or determine through MS Management Studio? – cbcoutinho Jun 20 '16 at 16:14
  • 1
    Please check on SO or ask google how to find indexes and execution plan. Beside that you would want to know how many rows in total are there in this table and how big it is... You are fetching whopping 3.2M of rows just for one month, one `wwResolution ` and 6 tags, so your table might contain much more and if your select statement will be doing FTS (Full Table Scan) - that means MS SQL will have to read the whole table first, apply filters and return data to the client. You can see that in the query execution plan. – MaxU - stand with Ukraine Jun 20 '16 at 16:42
  • 1
    Beside that did you compare how long does it take to run this query from your client machine using MS client tools with the (Python + SQLAlchemy + Pandas)? – MaxU - stand with Ukraine Jun 20 '16 at 16:43
  • @MaxU, Running the same query on another desktop with MSSQLMS installed had a similar runtime, within 15 sec of what I saw using the python workflow on my linux box. The table that I'm calling from is indeed enormous, that could be the reason for the traffic. On top of that, the execution plan says that 100% of the cost is in the final 'Remote Query' step, where it 'Sends a sql query to another than the current sql server'. So far I only know how to access the sql server from desktops in my office - I will attempt to access the server itself and see if I can execute this any faster. – cbcoutinho Jun 20 '16 at 17:07
  • 1
    So basically you were able to confirm, that the slowest part is MS SQL and/or network. I think when you executed exactly the same query a few times the vast majority of the data pages needed for the result set were already in the MS SQL Server buffer pool and because of that `100% of the cost is in the final 'Remote Query' step, where it 'Sends a sql query to another than the current sql server'`, otherwise it would have to read data from disk, which is usually the slowest part (unless you have an __extremely__ slow network). So, anyway you would have to check the execution plan... – MaxU - stand with Ukraine Jun 20 '16 at 17:51
  • I found the execution plan related to the remote query, and placed it into a gist [here](https://gist.github.com/cbcoutinho/f78c3867e2f7dba692bb241b4210a67b). I don't see it saying anything about a Full Table Scan, which I imagine would be pretty inefficient. Anything else? – cbcoutinho Jun 20 '16 at 20:01
  • 1
    `EstimateIO="0", EstimatedPagesCached="102048"` - as expected all your data has been read from cache. This looks to me like this query will be processed by another MS SQL server - `NodeId="1", PhysicalOp="Remote Query"`, but i'm just guessing, as i don't know MS SQL, so you'd better open a new question - asking how to improve this query, including this information and tag it with `sql-server` tag – MaxU - stand with Ukraine Jun 20 '16 at 20:40
  • Thanks MaxU, I appreciate the help. I'll create a new question. – cbcoutinho Jun 20 '16 at 20:42

1 Answers1

2

I guess you are calling your executeQuery() function in loop - quite a few times and each time you are re-creating a SQLAlchemy engine, which makes it slow.

So try to create your DB connection once and use it multiple times:

import pandas as pd
import pymssql
import sqlalchemy

connection_string = 'mssql+pymssql://user:password@server'
engine = sqlalchemy.create_engine(connection_string)

def executeQuery(sql_query, engine=engine):
    df = pd.read_sql(sql_query, engine)

    # i'm not sure that you really need it ...
    if 'DateTime' in df.columns:
        df.DateTime = pd.to_datetime(df.DateTime, infer_datetime_format=True)

    return df

PS if your DateTime column on the MS SQL side is of DateTime data type, SQLAlchemy should map it to Python datetime type. So at the end it should already be np.datetime64 ...

If you are calling it once or very few times, then I would do the following:

  • Check the execution plan (used indexes, data skewness, etc.)
  • execute this query on MS SQL server - and measure the execution time
  • execute this query on your client machine over the network - and measure the execution time

After that you'll see where you're loosing time at most...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • taking the connection statement out of the function makes sense; however that doesn't seem to be improving things on my side. Secondly, I haven't been able to get the 'parse_dates' option to work yet – cbcoutinho Jun 20 '16 at 16:17