1

I currently use pyodbc to read in as pandas dataframe, then I convert it to dask dataframe. Is there anyway to read in as Dask dataframe directly?

Here's the code I'm currently using:

import pandas as pd
import numpy as np
from dask.dataframe import from_pandas

def conn_sql_server(file_path):
    
    #Connect to SQL Server
    conn = pyodbc.connect('Driver= {SQL Server Native Client 11.0};'
                          'Server= Server1;'
                          'Database = Database1;'
                          'Trusted_Connection=yes;')
    
    #run query and ouput the result to df
    query = open(file_path, 'r') 
    df = pd.read_sql_query(query.read(), conn, chunksize=10**4)
    
    chunks =[]
    for chunk in df: 
        chunks.append(chunk)
    
    df_comb = pd.concat(chunks)
    
    query.close() 
    
    return df_comb 

#load in as pandas dataframe
data = conn_sql_server('.\input\data pull.sql')

#Convert to dask dataframe
dd = from_pandas(data, npartitions=3)

I tried to use dd.read_sql_query with pyodbc package or sqlalchemy package. Both returned an AttributeError: module 'dask.dataframe' has no attribute 'read_sql_query'

(1) pyodbc:

import dask.dataframe as dd   

def conn_sql_server(file_path):
    
    #Connect to SQL Server
    conn = pyodbc.connect('Driver= {SQL Server Native Client 11.0};'
                          'Server= Server1;'
                          'Database = Database1;'
                          'Trusted_Connection=yes;')
    
    #run query and ouput the result to df
    query = open(file_path, 'r') 
    df = dd.read_sql_query(query.read(), conn)
    
    query.close() 
    
    return df 

data = conn_sql_server('.\input\data pull.sql')

enter image description here

AttributeError: module 'dask.dataframe' has no attribute 'read_sql_query'

(2) sqlalchemy:

from sqlalchemy import create_engine

Server= 'Server1'
Database = 'Database1'
Driver= 'SQL Server Native Client 11.0'
uri = f'mssql://{Server}/{Database}?driver={Driver}'

query = open('.\input\data pull.sql', 'r')
dd.read_sql_query((query.read(), uri))

enter image description here

As Nick suggested, I've upgrade dask to the latest version using python -m pip install dask distributed --upgrade. I also checked all of the functions listed in the dask.dataframe module using the following script. Found out that there is only read_sql_table, no read_sql_query

from inspect import getmembers, isfunction
import dask.dataframe as dd
getmembers(dd, isfunction)
Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
Jiamei
  • 405
  • 3
  • 14
  • 1
    A quick look at the doco says you can use `dask.dataframe.read_sql_query` https://docs.dask.org/en/stable/generated/dask.dataframe.read_sql_query.html#dask.dataframe.read_sql_query – Nick.Mc May 29 '22 at 12:06
  • @Nick.McDermaid I imported dask.dataframe as dd and changed pd.read_sql_query() into dd.read_sql_query, but got an AttributeError: module 'dask.dataframe' has no attribute 'read_sql_query'. – Jiamei May 30 '22 at 22:30
  • I also tried to establish the connection using sqlalchemy and use dd.read_sql_query, also get the same AttrivuteError. – Jiamei May 30 '22 at 23:36
  • Sounds like you're not the only person with this problem https://stackoverflow.com/questions/71682006/dask-dataframe-has-no-attribute-read-sql-query I'm just going by what I see in the docs – Nick.Mc May 31 '22 at 00:21
  • @Nick.McDermaid Yep, I've went through each individual post about this issue and unfortunately, I still haven't found an answer. – Jiamei May 31 '22 at 03:25
  • Can you perhaps post whatever version of dask you are using? – Nick.Mc May 31 '22 at 03:38
  • I installed the dask[complete] version just two days ago, I assume it's the most up-to-date version. – Jiamei May 31 '22 at 03:52
  • The SO I posted said this: _I was able to reproduce this with dask=2022.1.0, and I was able to fix it by upgrading to dask=2022.3.0_ so maybe you should go through all the posts again. You really need to know what version you have to help troubleshoot this. I suggest you find out and post it – Nick.Mc May 31 '22 at 03:59
  • This versioning nightmare has spawned it's own set of library management tools (i.e. pip, conda etc.) – Nick.Mc May 31 '22 at 04:04
  • I just upgraded to 2022.5.2 version, still returning the same error. – Jiamei May 31 '22 at 04:38
  • I'm just walking through troubelshooting steps here until someone arrives that knows what they're talking about. Perhaps you could use this to list the atrributes in the module and see if it has a different name or is case sensitive or something. https://stackoverflow.com/questions/139180/how-to-list-all-functions-in-a-module – Nick.Mc May 31 '22 at 04:44
  • This is such a great idea! Thanks for bring that up, Nick! Anyways, I just checked, but there is only read_sql_table, but no read_sql_query. However, read_sql_table reads in the whole table instead of the query, so I don't think it works for my case – Jiamei May 31 '22 at 05:04
  • As a test, does read_sql_table work? – Nick.Mc May 31 '22 at 05:08
  • I tried to load in a table, it worked. – Jiamei May 31 '22 at 05:25
  • I suggest you raise a Git issue here https://github.com/dask/dask/issues make sure you quote the version and reproduce the function list and reference the doc page. In fact probably reference this SO – Nick.Mc May 31 '22 at 05:31
  • oh and it really helps to convert your pictures to text, as then they are searchable. – Nick.Mc May 31 '22 at 08:39
  • OK, will do! Thanks for all of your help! – Jiamei May 31 '22 at 16:46
  • Also... when you raise the git issue post it back in here. Now everyone searching for that error message has all the info. – Nick.Mc May 31 '22 at 23:18

0 Answers0