7

I'm currently switching from R to Python (anconda/Spyder Python 3) for data analysis purposes. In R I used to use a lot R sqldf. Since I'm good at sql queries, I didn't want to re-learn data.table syntax. Using R sqldf, I never had performance issues.

Now, in Python I tried using pandasql a simple df = "SELECT * From table LIMIT 1" will last forever on 193k rows, 19 columns.

I tried pysqldf but I get an error saying that the table doesn't exist, but it does exist.

# -*- coding: utf-8 -*-

import pandas as pd
import pandasql 
import pysqldf

#Data loading    
orders = pd.read_csv('data/orders.csv',sep = ';')

###### PANDASQL ######
test = pandasql.sqldf("SELECT  orders_id from orders LIMIT 1;",globals())
# Will last several minutes and use a lot of RAM

test = pandasql.sqldf("SELECT  orders_id from orders LIMIT 1;",locals())
# Will last several minutes and use a lot of RAM


###### PYSQLDF ######
sqldf = pysqldf.SQLDF(globals())
test = sqldf.execute("SELECT  * from orders LIMIT 1;")
#error
#Error for pysqldf

Traceback (most recent call last):

  File "<ipython-input-12-30b645117dc4>", line 1, in <module>
    test = sqldf.execute("SELECT  * from orders LIMIT 1;")

  File "C:\Users\p.stepniewski\AppData\Local\Continuum\anaconda3\lib\site-packages\pysqldf\sqldf.py", line 76, in execute
    self._del_table(tables)

  File "C:\Users\p.stepniewski\AppData\Local\Continuum\anaconda3\lib\site-packages\pysqldf\sqldf.py", line 117, in _del_table
    self.conn.execute("drop table " + tablename)

OperationalError: no such table: orders

Am I missing something? Would prefer a pandasql/pysqldf answer before a "learn panda querying syntax".

Sqldf in R worked on complex queries on tables up to 10millions rows, on an i7/12G ram laptop.

Thanks !

nick
  • 1,090
  • 1
  • 11
  • 24
  • I have tried the above with a reasonably big file (10^5 lines and 5 columns) and I did not see any particular performance problem. How big is the file you are trying to read in? – gented Jul 30 '18 at 09:53
  • Hi, thanks for your comment. The orders.csv files is 17megBytes large. – Philippe Stepniewski Jul 30 '18 at 10:11
  • It should still be fine: are you sure that the RAM consumption doesn't come from elsewhere? Just for a test, try to query the data frame using pandas standard syntax: does it make any difference in performance on your computer? – gented Jul 30 '18 at 10:16
  • Just tried launching from cygwin (not in Spyder). No changes. Tried test = orders.iloc[[1]] => worked perfectly. Maybe something's wrong with my anaconda pandasql install ? – Philippe Stepniewski Jul 30 '18 at 11:59
  • Having a look around in related questions it seems that a few users are seeing the same problem: I wouldn't really know what the root cause is, though. – gented Jul 30 '18 at 12:24
  • Ok just found the solution. Entirely dropped the Anaconda install. Cleaned related folders. Installed from scratch Python 3.6 with PIP. Then pip installe pandas, pandasql. Launched my script. All the script executed in less than a second. – Philippe Stepniewski Jul 30 '18 at 12:49
  • pandas.read_sql(query, con = db) works super nicely for me. – Ankur Sinha Jul 30 '18 at 12:54
  • pandas.read_sql() is for querying a distant database. It's not the same thing as Sqldf. Sqldf emulates Sqllite queries on local pandas dataframe – Philippe Stepniewski Jul 30 '18 at 12:58
  • Regardless of your proficiency with sql, an sql query is (almost) an "homeomorphic" transcription of a set definition, and so I find it very natural. I am here because I run a sql query computing counts group by, and it took much longer than using the value_counts method. Pity. – VictorZurkowski May 27 '22 at 02:05

2 Answers2

2

Ok just found the solution.

  • Entirely dropped the Anaconda install.
  • Cleaned related folders.
  • Installed from scratch Python 3.6 with PIP.
  • Then pip installed pandas, pandasql.
  • Launched my script. Script executed in less than a second (pandasql)
2

pandasql worked fine on a small table but it was useless on a big table (much slower than querying the actual database even though we had already done all the JOINs and WHERE). After seeing this answer and googling a bit more I found that duckdb will also run against a pandas dataframe. From their blog:

# to install: pip install duckdb
import pandas as pd
import duckdb

mydf = pd.DataFrame({'a' : [1, 2, 3]})
print(duckdb.query("SELECT SUM(a) FROM mydf").to_df())

And it's super-duper fast on my laptop and uses all cores. Still uses a lot of RAM but I think that's Pandas.

Link: https://duckdb.org/2021/05/14/sql-on-pandas.html

grofte
  • 1,839
  • 1
  • 16
  • 15