9

Lets suppose I create a dataframe with columns and query i.e

pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b']).query('a>1')

This will give me

   a  b
1  3  4
2  5  6

But when dataframe values are too large and I don't have column names, how can I query a column by its index?

I tried querying by passing a number, but it's not the way of doing it.

pd.DataFrame([[1,2],[3,4],[5,6]]).query('0>1') # This is what I tried. 

How to denote 0 is the column name in query?

Expected Output:

   0  1
1  3  4
2  5  6
cs95
  • 379,657
  • 97
  • 704
  • 746
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • `df.query("index > 1")`? Or are you referring to the columns? Sorry, what's your expected output? – cs95 Jan 04 '18 at 06:27
  • @cᴏʟᴅsᴘᴇᴇᴅ no. I want to select the data from dataframe where column 1 has values > 1. – Bharath M Shetty Jan 04 '18 at 06:29
  • Okay. Makes sense to me. `query` is a lot like using the dot notation for dataframe columns. You could do `df.A`, but not `df.1`. Similarly, you can use query for such valid column names, but I don't think numbers fly. – cs95 Jan 04 '18 at 06:30
  • Oh `query` can only be used only for column names which can be expressed in dot notation? Anyway I can do the operation on the fly without storing the dataframe in a variable? – Bharath M Shetty Jan 04 '18 at 06:32
  • This is my belief. Also, I don't think doing something like `a = 1; df.query("@a > 1")` works, because eval will resolve a to `1 > 1` which still wouldn't work. – cs95 Jan 04 '18 at 06:35
  • For the index there is a way to represent the levels like `ilevel_0`, wish there was a similar name for repesenting columns. – Bharath M Shetty Jan 04 '18 at 06:39
  • @ayhan IIRC they added support for `level_*` (multi-index queries) in 0.22! – cs95 Jan 04 '18 at 06:41
  • 2
    @cᴏʟᴅsᴘᴇᴇᴅ - No, it is supported at least in `0.13.version+` - check [this](http://pandas.pydata.org/pandas-docs/version/0.13.0/indexing.html#multiindex-query-syntax) - `df.query('ilevel_0 == "red"')` – jezrael Jan 04 '18 at 06:44
  • I think somebody will update the query code, let the question be open. Lets put the answer once the issue is resolved. Thankyou @ayhan for bringing it up. – Bharath M Shetty Jan 04 '18 at 06:44

3 Answers3

7

Since the query is under development one possible solution is creating a monkey patch for pd.DataFrame to evaluate self i.e :

def query_cols(self,expr):
    if 'self' in expr:
        return self[eval(expr)]
    else:
        return self.query(expr)

pd.DataFrame.query_cols = query_cols

pd.DataFrame([[1,2],[3,4],[5,6]]).query_cols('self[1] > 3')

   0  1
1  3  4
2  5  6

pd.DataFrame([[1,2],[3,4],[5,6]]).query_cols('self[1] == 4')

   0  1
1  3  4

pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b']).query_cols('a > 3')

   a  b
2  5  6

This is a simple trick and doesn't suit all the cases, answer will be updated when the issue with query is resolved.

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
3

Solution

An option without any monkey patching is to use @ to define a variable and do this as follows.

# If you are fond of one-liners
df = pd.DataFrame([[1,2],[3,4],[5,6]]); df.query('@df[0] > 1')

# Otherwise this is the same as
df = pd.DataFrame([[1,2],[3,4],[5,6]])
df.query('@df[0] > 1') # @df refers to the variable df

Output:

   0  1
1  3  4
2  5  6

References

You can find more ways of dealing with this here.

CypherX
  • 7,019
  • 3
  • 25
  • 37
  • @jezrael See this method using `@` to use query on pandas dataframes with numeric column names. – CypherX Jul 09 '21 at 03:54
  • @Bharath See this method using `@` to use query on pandas dataframes with numeric column names. – CypherX Jul 09 '21 at 03:55
  • If I can store the data frame in the variable why would I even want to use query? I would simply do `df[df[0]>1]`. The question was about using `query` during the initialization. – Bharath M Shetty Jul 09 '21 at 04:36
  • @Bharath I see. But since `@` option was not mentioned, I thought of adding that here. Perhaps it will be relevant to someone else, who's probably looking for `pandas + query + numeric column name` as the search query. – CypherX Jul 09 '21 at 05:23
0

You can create an intermediate column with assign + a lambda function:

pd.DataFrame([[1, 2], [3, 4], [5, 6]]).assign(col=lambda x: x[0]).query("col>1")

00schneider
  • 698
  • 9
  • 21