Reading an SQL query into a Dask DataFrame

Question

I'm trying create a function that takes an SQL SELECT query as a parameter and use dask to read its results into a dask DataFrame using the dask.read_sql_query function. I am new to dask and to SQLAlchemy. I first tried this:

import dask.dataFrame as dd

query = "SELECT name, age, date_of_birth from customer"
df = dd.read_sql_query(sql=query, con=con_string, index_col="name", npartitions=10)

As you probably already know, this won't work because the sql parameter has to be an SQLAlchemy selectable and more importantly, TextClause isn't supported.

I then wrapped the query behind a select like this:

import dask.dataFrame as dd
from sqlalchemy import sql

query = "SELECT name, age, date_of_birth from customer"
sa_query = sql.select(sql.text(query))
df = dd.read_sql_query(sql=sa_query, con=con_string, index_col="name")

This fails too with a very weird error that I have been trying to solve. The problem is that dask needs to infer the types of the columns and it does so by reading the first head_row rows in the table - 5 rows by default - and infer the types there. This line in the dask codebase adds a LIMIT ? to the query, which ends up being

SELECT name, age, date_of_birth from customer LIMIT param_1

The param_1 doesn't get substituted at all with the right value - 5 in this case. It then fails on the next line, https://github.com/dask/dask/blob/main/dask/dataframe/io/sql.py#L119, tjat evaluates the SQL expression.

sqlalchemy.exc.ProgrammingError: (mariadb.ProgrammingError) You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT name, age, date_of_birth from customer 
 LIMIT ?' at line 1
[SQL: SELECT SELECT name, age, date_of_birth from customer 
 LIMIT ?]
[parameters: (5,)]
(Background on this error at: https://sqlalche.me/e/14/f405)

I can't understand why param_1 wasn't substituted with the value of head_rows. One can see from the error message that it detects there's a parameter that needs to be used for the substitution but for some reason it doesn't actually substitute it.

Perhaps, I didn't correctly create the SQLAlchemy selectable?

I can simply use pandas.read_sql and create a dask dataframe from the resulting pandas dataframe but that defeats the purpose of using dask in the first place.

I have the following constraints:

I cannot change the function to accept a ready-made sqlalchemy selectable. This feature will be added to a private library used at my company and various projects using this library do not use sqlalchemy.
Passing meta to the custom function is not an option because it would require the caller do create it. However, passing a meta attribute to read_sql_query and setting head_rows=0 is completely ok as long as there's an efficient way to retrieve/create
while dask-sql might work for this case, using it is not an option, unfortunately

How can I go about correctly reading an SQL query into dask dataframe?

Could you elaborate why passing `meta` is not an option? If pandas approach works, it should be possible to run a small query to set up meta and then pass it to dask... — SultanOrazbayev, May 25 '22 at 16:10
@SultanOrazbayev Oh I meant passing `meta` to the custom function is not an option because it would require the caller to create it. Passing `meta` to `read_sql_query` is completely ok if there's a way to retrieve it efficiently. I realise I should edit my question to reflect that. Your suggestion is brilliant by the way. I can't believe I didn't think of that. What kind of query can I run to set up the `meta`. Something like adding a limit to the original sql query? — mkab, May 26 '22 at 20:09
actually, that's what dask does behind the scenes (execute a small query using pandas to figure out meta). By doing it manually however we can remove one potential source of error (re: param_1 substitution you mentioned). — SultanOrazbayev, May 27 '22 at 02:10

score 2 · Answer 1 · answered May 27 '22 at 03:58

The crux of the problem is this line:

sa_query = sql.select(sql.text(query))

What is happening is that we are constructing a nested SELECT query, which can cause a problem downstream.

Let's first create a test database:

# create a test database (using https://stackoverflow.com/a/64898284/10693596)
from sqlite3 import connect

from dask.datasets import timeseries

con = "delete_me_test.sqlite"
db = connect(con)

# create a pandas df and store (timestamp is dropped to make sure
# that the index is numeric)
df = (
    timeseries(start="2000-01-01", end="2000-01-02", freq="1h", seed=0)
    .compute()
    .reset_index()
)
df.to_sql("ticks", db, if_exists="replace")

Next, let's try to get things working with pandas without sqlalchemy:

from pandas import read_sql_query

con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
meta = read_sql_query(sql=query, con=con).set_index("index")

print(meta)
#          id    name         x         y
# index
# 0       998  Ingrid  0.760997 -0.381459
# 1      1056  Ingrid  0.506099  0.816477
# 2      1056   Laura  0.316556  0.046963

Now, let's add sqlalchemy functions:

from pandas import read_sql_query
from sqlalchemy.sql import text, select

con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
sa_query = select(text(query))
meta = read_sql_query(sql=sa_query, con=con).set_index("index")
# OperationalError: (sqlite3.OperationalError) near "SELECT": syntax error
# [SQL: SELECT SELECT * FROM ticks LIMIT 3]
# (Background on this error at: https://sqlalche.me/e/14/e3q8)

Note the SELECT SELECT due to running sqlalchemy.select on an existing query. This can cause problems. How to fix this? In general, I don't think there's a safe and robust way of transforming arbitrary SQL queries into their sqlalchemy equivalent, but if this is for an application where you know that users will only run SELECT statements, you can manually sanitize the query before passing it to sqlalchemy.select:

from dask.dataframe import read_sql_query
from sqlalchemy.sql import select, text

con = "sqlite:///test.sql"
query = "SELECT * FROM ticks"


def _remove_leading_select_from_query(query):
    if query.startswith("SELECT "):
        return query.replace("SELECT ", "", 1)
    else:
        return query


sa_query = select(text(_remove_leading_select_from_query(query)))
ddf = read_sql_query(sql=sa_query, con=con, index_col="index")

print(ddf)
print(ddf.head(3))
# Dask DataFrame Structure:
#                   id    name        x        y
# npartitions=1
# 0              int64  object  float64  float64
# 23               ...     ...      ...      ...
# Dask Name: from-delayed, 2 tasks
#          id    name         x         y
# index
# 0       998  Ingrid  0.760997 -0.381459
# 1      1056  Ingrid  0.506099  0.816477
# 2      1056   Laura  0.316556  0.046963

Thanks for the great explanation! I guessed that the crux of the problem was with the `sql.select statement`. So, I am sure that only SELECT statements will be called with the function. I think your solution to replace the `SELECT` still fails though. Did you mean to also pass the `meta` attribute to `read_sql_query`? I did it as you explained here and I still get the same error. Note that the error on dask is not about the `SELECT SELECT` (though that's on panda's end) but about the `LIMIT` parameter not being substituted — mkab, May 27 '22 at 07:37
@mkab: yes, please refer to the full dask snippet (LIMIT is removed in it since it's not necessary for dask). — SultanOrazbayev, May 27 '22 at 07:38
I mean, the query is modified in the last dask snippet (LIMIT is removed there, and it was used in pandas only for quick checks)... that leaves the question of what happens if user passes a query containing LIMIT, but... I do not think one can achieve a 100% conversion of any query. — SultanOrazbayev, May 27 '22 at 07:44
sorry, perhaps I'm slow here but I'm not sure I understood you correctly. Did you mean to pass `meta` and set `head_rows=0` because if you don't dask will automatically add the LIMIT and then the same problem arises. Check here: https://github.com/dask/dask/blob/main/dask/dataframe/io/sql.py#L116:L119 If you did mean to add `meta`, could you edit your answer to reflect it so that it can help others? — mkab, May 27 '22 at 08:00
In the last dask snippet, you removed SELECT not LIMIT. Yes, I do not think we can achieve 100% conversion of any query as it's very complicated — mkab, May 27 '22 at 08:02

Reading an SQL query into a Dask DataFrame

1 Answers1

Linked