Executing an SQL query over a pandas dataset

Question

I have a pandas data set, called 'df'.

How can I do something like below;

df.query("select * from df")

Thank you.

For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python

Sorry, our imaginations are poor. Please provide some concrete data and what you _actually_ want. — cs95, Aug 24 '17 at 15:33
What you want is not possible. Dataframes are no SQL databases and can not be queried like one. — Deb, Aug 24 '17 at 15:39
the closest thing to what you want is this : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html and it's not SQL. — Mohamed Ali JAMAOUI, Aug 24 '17 at 15:45

score 135 · Accepted Answer · edited Jan 09 '21 at 06:59

135

This is not what pandas.query is supposed to do. You can look at package pandasql (same like sqldf in R )

import pandas as pd
import pandasql as ps

df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
               [1234, 'Customer A', np.nan, '333 Street'],
               [1233, 'Customer B', '444 Street', '333 Street'],
              [1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])

q1 = """SELECT ID FROM df """

print(ps.sqldf(q1, locals()))

     ID
0  1234
1  1234
2  1233
3  1233

Update 2020-07-10

update the pandasql

ps.sqldf("select * from df")

edited Jan 09 '21 at 06:59

AdamAL

1,571
2
14
25

answered Aug 24 '17 at 16:03

BENY

317,841
20
164
234

I get this error due to numpy not being imported: Traceback (most recent call last): File "", line 1, in NameError: name 'np' is not defined – Jas Aug 15 '19 at 20:21
1

@Jas that was just data import if you change the np.nan to 1000 , it will gone – BENY Aug 15 '19 at 21:09
1

FYI, it doesn't look like this works anymore. I get the error `AttributeError: 'Connection' object has no attribute 'cursor'`. It might work on older versions of `pandas`; I'm using v1.3.4. – Matt Sosna Dec 01 '21 at 21:38
@MattSosna it still work for me ~ – BENY Dec 01 '21 at 21:45
Not sure if pandasql is maintained anymore. DuckDb might be a better option Performance benchmark --> https://duckdb.org/2021/05/14/sql-on-pandas.html – Pramit Jun 30 '23 at 15:15

Miguel Santos · Answer 2 · 2020-07-10T11:12:52.710

29

After some time of using this I realised the easiest way is to just do

from pandasql import sqldf

output = sqldf("select * from df")

Works like a charm where df is a pandas dataframe You can install pandasql: https://pypi.org/project/pandasql/

edited Jul 10 '20 at 11:12

answered Jul 10 '20 at 11:07

Miguel Santos

1,806
3
17
30

1

Nice , I still with the old version pandasql , also update the my answer, thanks ~ – BENY Jul 10 '20 at 14:28

Leo Liu · Answer 3 · 2022-08-16T04:08:59.793

Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.

pip install duckdb

import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})

duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe

Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data

nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s

duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms

A improvement of speed of roughly 100x

This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html

This should be the accepted solution. – diogeek Jun 15 '23 at 12:37 — diogeek, Jun 15 '23 at 12:37

score 6 · Answer 4 · answered Aug 24 '17 at 15:47

6

You can use DataFrame.query(condition) to return a subset of the data frame matching condition like this:

df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
   A  B  C
0  0  1  2
1  3  4  5
2  6  7  8

df.query('C < 6')
   A  B  C
0  0  1  2
1  3  4  5


df.query('2*B <= C')
   A  B  C
0  0  1  2


df.query('A % 2 == 0')
   A  B  C
0  0  1  2
2  6  7  8

This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE is implied.

answered Aug 24 '17 at 15:47

user1717828

7,122
8
34
59

You can add. `df.eval` https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.eval.html – BENY Aug 24 '17 at 15:55
In the 3 examples given above, it's easy to understand the 3 different conditions. If one uses normal subsetting, the same can be achieved though with very similar code: df[df['C'] < 6], df[2*df['B'] <= df['C']] and df[df['A'] % 2 == 0]. I don't see why one would like to SQL anyway. Pandas even has methods like 'groupby" that can be applied to a dataframe to achieve the same as what e.g. a groupby query would return – Dobedani Jan 09 '23 at 13:04
@Dobedani -- yup, agree that's the preferred syntax. The reason I go with `df.query()` is in cases where I don't want to rewrite the dataframe name. This is common during exploratory data analysis when I might have lots of dataframes I want to run the same stuff on and sticking to method chaining like `.query()` let's me simply swap the variable at the beginning of the chain. – user1717828 Jan 09 '23 at 15:43

Zach Brookler · Answer 5 · 2020-08-19T21:27:15.333

5

There's actually a new package that I just released, called dataframe_sql. This gives you the ability to query pandas dataframes using SQL just as you want to. You can find the package here

edited Aug 19 '20 at 21:27

answered Mar 01 '20 at 01:50

Zach Brookler

334
3
7

i was not able to find a conda distribution for the same and our support team can only use conda tools to install packages, not pip. Do you have any plan on releasing it? – Sajal May 26 '21 at 18:50
@Sajal I can certainly look into it, would you mind raising this as an issue on GitHub? – Zach Brookler May 31 '21 at 16:55

score 2 · Answer 6 · answered Apr 05 '21 at 15:10

2

I think a better solution than pandassql would be duckdb. The way it handles the table name mapping to a dataframe object is a little cleaner imo. I have not evaluated performance though.

answered Apr 05 '21 at 15:10

user3418197

39
3

score 1 · Answer 7 · answered Nov 07 '20 at 12:49

Or, you can use the tools that do what they do best:

Install postgresql
Connect to the database:

from sqlalchemy import create_engine
import urllib.parse
engconnect = "{0}://{1}:{2}@{3}:{4}/{5}".format(dialect,user_uenc, pw_uenc, host,port, dbname)
dbengine = create_engine(engconnect)
database = dbengine.connect()

Dump the dataframe into postgres

df.to_sql('mytablename', database, if_exists='replace')

Write your query with all the SQL nesting your brain can handle.

myquery = "select distinct * from mytablename"

Create a dataframe by running the query:

newdf = pd.read_sql(myquery, database)

oh the horror, please don't do this in real life – Sebastian Wozny Dec 27 '21 at 21:35 — Sebastian Wozny, Dec 27 '21 at 21:35

score 1 · Answer 8 · answered Jan 15 '22 at 16:28

There is also FugueSQL

pip install fugue[sql]

import pandas as pd
from fugue_sql import fsql

comics_df = pd.DataFrame({'book': ['Secret Wars 8',
                                   'Tomb of Dracula 10',
                                   'Amazing Spider-Man 252',
                                   'New Mutants 98',
                                   'Eternals 1',
                                   'Amazing Spider-Man 300',
                                   'Department of Truth 1'],
                          'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
                          'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
                          'value': [400, 2500, 300, 600, 400, 750, 175]})

# which of my books are graded above 8.0?
query = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""

fsql(query).run()

Output

PandasDataFrame
book:str                                                      |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8                                                 |Marvel       |9.6         |400       
Eternals 1                                                    |Marvel       |9.2         |400       
Department of Truth 1                                         |Image        |9.8         |175       
Total count: 3

References

^{https://fugue-tutorials.readthedocs.io/tutorials/beginner/beginner_sql.html}

^{https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html}

score 0 · Answer 9 · answered Jan 26 '22 at 01:31

Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements. It also provides a convenient %rbql magic command to use in Jupyter/IPyhon:

# Get some test data:
!pip install vega_datasets
from vega_datasets import data
my_cars_df = data.cars()
# Install and use RBQL:
!pip install rbql
%load_ext rbql
%rbql SELECT * FROM my_cars_df WHERE a.Horsepower > 100 ORDER BY a.Weight_in_lbs DESC

In this example my_cars_df is a Pandas Dataframe.

You can try it in this demo Google Colab notebook.

Executing an SQL query over a pandas dataset

9 Answers9

References

Linked

Related