13

The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.

So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

5 Answers5

9

After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.

If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.


UPDATE:

javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...

df.eval('A')

returns a Pandas Series, but

df.eval(['A', 'B'])

does not return at DataFrame, it returns a list (of Pandas Series).

So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.

df.loc[0:4, ['A', 'C']]

output

          A         C
0 -0.497163 -0.046484
1  1.331614  0.741711
2  1.046903 -2.511548
3  0.314644 -0.526187
4 -0.061883 -0.615978
Max Power
  • 8,265
  • 13
  • 50
  • 91
  • but `eval` does not return a DataFrame .. `ret : ndarray, scalar, or pandas object` . In any case upvoted for the effort. – WestCoastProjects Jun 18 '17 at 01:52
  • hm good point. just tried `iris2 = iris.eval(['sepal_length', 'species'])` but the `iris2` I got back was a `list`...with each element a `Pandas Series`. Weird. – Max Power Jun 18 '17 at 01:56
  • Looks like we're back to `iloc/loc`. maybe play with that a bit and i can award here. – WestCoastProjects Jun 18 '17 at 01:58
  • udpated my answer. don't consider this a particularly satisfying answer, but think it's the answer. – Max Power Jun 18 '17 at 02:06
  • Sorry i'm going to renege on the "accept" deal: I found `pandsql` that does what is requested: see my answer below. Hold on.. i'm thinking through this a bit more.. Yes It *is* what I want : just not in the expected format. The inputs `meat` and `births` are `DataFrame`s. So instead of invoking methods *already* defined on the `df` it is using a method to generate a new one using one or `df`s as input. – WestCoastProjects Jun 18 '17 at 02:30
  • haha you're welcome to withdraw the 'accept' but use pandasql at your peril. lots of bugs. every few months I decide to try it again when I want to write sql, and it never goes well. – Max Power Jun 18 '17 at 02:33
  • i see. ok i'll put back the accept here: since that's useful info. But will keep my answer below for the curious. Does the `pandasql` handle basic projection, filtering and aggregation ok? i.e. does it mostly fail on `join`s? – WestCoastProjects Jun 18 '17 at 02:34
  • `pandasql` is *really* doing the job here. If/when I were to run into bugs i'm tempted to actually **fix** them since the tool is so useful. – WestCoastProjects Jun 18 '17 at 04:40
  • (Multi years later): _pandasql_ is however ***very*** slow – WestCoastProjects Jul 16 '21 at 14:30
  • Hey! I remember this discussion well. What I've ended up doing for this use case, if I have a pandas project I want to occasionally run SQL queries in, is to create temporary in-memory sqlite databases, which are easy to load a DF and export to DF via pandas, and execute the sql queries in sqlite. Would this work for you? I'll also note that in the three years since we had this conversation, there haven't been any commits to pandasql, which seems unmaintained at this point: https://github.com/yhat/pandasql/ – Max Power Jul 16 '21 at 23:10
  • Yea it was essentially unmaintained since 2017. Sqlite is v limited dialect but for basic things this can work – WestCoastProjects Jul 17 '21 at 02:28
  • do you have a preferred means of using sql queries on pandas dataframes these days? – Max Power Jul 17 '21 at 02:57
  • 1
    I use it when performance is not important and prefer to use postgres backend since it supports analytics/windowing functions. Same as years back actually. The other way I do more often is using spark sql – WestCoastProjects Jul 17 '21 at 04:36
4

Dataframe.query is more like the where clause in a SQL statement than the select part.

import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

To select a column or columns you can use the following:

df['A'] or df.loc[:,'A']

or

df[['A','B']] or df.loc[:,['A','B']]

To use the .query method you do something like

df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.

                   A         B         C         D
2000-01-03  1.265936 -0.866740 -0.678886 -0.094709
2000-01-04  1.491390 -0.638902 -0.443982 -0.434351
2000-01-05  2.205930  2.186786  1.004054  0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589

Which is more readable in my opinion that boolean index selection with

df[df['A'] > df['B']]
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
3

How about

df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]

Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.

but you would need to filter for rows otherwise it doesn't work.

for filtering columns only better use .loc or .iloc

Fabich
  • 2,768
  • 3
  • 30
  • 44
gonkan
  • 31
  • 1
2

pandasql

https://pypi.python.org/pypi/pandasql/0.1.0

Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html . The inputs are two DataFrames meat and births : and this approach gives the projections, filtering, aggregation and sorting expected from sql.

@maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.

pysqldf = lambda q: sqldf(q, globals())

q  = """
SELECT
  m.date
  , m.beef
  , b.births
FROM
  meat m
LEFT JOIN
  births b
    ON m.date = b.date
WHERE
    m.date > '1974-12-31';
"""

meat = load_meat()
births = load_births()

df = pysqldf(q)

The output is a pandas DataFrame as desired.

It is working great for my particular use case (evaluating us crimes)

odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)

 odf
:    SMURDER  SRAPE  SROBBERY  SAGASSLT  SOTHASLT  SVANDLSM  SWEAPONS
0        0      0         0         1         1        10        54
1        0      0         0         0         1         0        52
2        0      0         0         0         1         0        46
3        0      0         0         0         1         0        43
4        0      0         0         0         1         0        33
5        1      0         2        16        28         4        32
6        0      0         0         7        17         4        30
7        0      0         0         0         1         0        29
8        0      0         0         7        16         3        29
9        0      0         0         1         0         5        28

Update I have done a bunch of stuff with pandasql now: calculated fields, limits, aliases, cascaded dataframes.. it is just so productive.

Another update (3 yrs later) This works but warning it is very slow (seconds vs milliseconds) –

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • Glad this is working so well for your case. I have been frustrated by it a couple times, but maybe that's on me for not contributing to fix it. Here's the last bug I ran into - [select from multiple tables](https://github.com/yhat/pandasql/issues/46) doesn't work. Which is a shame because that's the kind of operation which reads so much nicer in SQL than in base pandas. I also worry that since this issue has been open for ~18 months, and has no person or even labels assigned to it, the library is probably not well maintained. – Max Power Jun 18 '17 at 13:18
  • @MaxPower I am now using `postgresql` dialect with `pandasql` now - as opposed to the default and limited `sqlite`. It is working better so far. – WestCoastProjects Jul 26 '18 at 15:23
1

Just a simpler example solution (using get):

My goal:

I want the lat and lon columns out of the result of the query.

My table details:

df_city.columns

Index(['name', 'city_id', 'lat', 'lon', 'CountryName', 'ContinentName'], dtype='object')

# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')

# Only lat and lon
city_continent[['lat', 'lon']]
  lat lon
113883    -19.12753   -169.84623
113884    -19.11667   -169.90000
113885    -19.10000   -169.91667
113886    -46.33333   168.85000
113887    -46.36667   168.55000
...   ... ...
347956    -23.14083   113.77630
347957    -31.48023   131.84242
347958    -28.29967   153.30142
347959    -35.60358   138.10548
347960    -35.02852   117.83416
3712 rows × 2 columns
Arindam Roychowdhury
  • 5,927
  • 5
  • 55
  • 63