Issue
I'm trying to read a table in a MS SQL Server using python, specifically SQLalchemy, pymssql, and pandas.read_sql. I want to execute the query, put the results into a pandas Dataframe, and continue on with my day at a reasonable speed, but a rather simple query (output = 100 MB) is taking almost 5-10 minutes on a LAN connection using ethernet cables - no wifi in sight.
The resulting dataframe with 3 columns and 3214086 entries took a whopping 423 seconds. The following info():
Int64Index: 3214086 entries, 0 to 3214085
Data columns (total 3 columns):
DateTime datetime64[ns]
TagName object
Value float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 98.1+ MB
None
If my math is correct, 100 MB in 423 seconds is about 230 kB/s, which I think over an Ethernet connection is just painfully slow. The bottleneck must be in the sql server itself
Query Info
The SQL server itself is a Wonderware set up, which may or may not have something to do with it. In the following query I'm asking for values of 6 different tags over a timespan of 1 month, where the resolution is 5000 ms, or 5 seconds.
Here is what the query itself looks like:
sql_query = '''
SELECT DateTime, TagName, Value
FROM Runtime.dbo.AnalogHistory
WHERE
DateTime BETWEEN '2014-05-26 00:00' AND '2014-06-26 00:00'
AND
TagName IN (
'Tag1.ActualValue',
'Tag2.ActualValue',
'Tag3.ActualValue',
'Tag4.ActualValue',
'Tag5.ActualValue',
'Tag6.ActualValue')
AND
wwRetrievalMode = 'Cyclic'
AND
wwResolution = 5000
'''
And finally, the executeQuery function:
import pandas as pd
import pymssql
import sqlalchemy
def executeQuery(sql_query):
connection_string = 'mssql+pymssql://user:password@server'
engine = sqlalchemy.create_engine(connection_string)
df = pd.read_sql(sql_query, engine)
if 'DateTime' in df.columns:
df.DateTime = pd.to_datetime(df.DateTime, infer_datetime_format=True)
return df
Question
What's going on here? Is this a limitation within one of the packages I'm using, or can I speed things up with what I have?
This question seems related in the fact that the OP was trying to write to an sql server. There it states it's faster to use 'BULK INSERT' instead of through pandas using a .csv file, but that is not a viable option for reading from an sql server. Am I supposed to have a directory containing .csv files of all the date/times and all the tags!? That would negate the use of the sql server in the first place, wouldn't it?