2

How do I refer to an unnamed DataFrame column in a query string when using pandas.DataFrame.query? I know I can to column names that are not valid Python variable names by surrounding them in backticks. However, that does not address unnamed columns.

For example, I would like to query for all rows in a DataFrame where an unnamed column contains a value greater than 0.5.

My code starts like:

import pandas as pd
import numpy as np
array=np.random.rand(10,3)
df=pd.DataFrame(array)

so far so good but then when I try to use pandas.DataFrame.query what query string should I use to find rows where the value in the second column (which happens to unnamed) are greater than 0.5 ?

The closest thing I can think of is

df.query('columns[1]>0.5')

which is flat out wrong because columns columns[1] returns the column number, 1, and does not reference the unnamed column.

I have looked at the Pandas documentation including

Any ideas?

CypherX
  • 7,019
  • 3
  • 25
  • 37
rossmpersonal
  • 49
  • 1
  • 5
  • So pandas doesn't have "unnamed" columns when constructed this way. They will be initialized to defaults -> `[0, 1, 2]`. Then it becomes similar to [How to query a numerical column name in pandas?](https://stackoverflow.com/q/48089786/15497888) – Henry Ecker Jul 09 '21 at 02:45

2 Answers2

0

Solution

Summary: Best options are given below. See further down for all other options.

  • df.query('@df[1] > 0.5')
  • df[df[1] > 0.5]

Unnammed columns in pandas are automatically named as 0, 1, 2, ..., where these are numbers and not strings.

The following shows you mainly three ways to achieve what you are looking for.

  • Option-1: Avoid renaming columns.

    • Option-1.1: using df.query('@df[1] > 0.5'). Here we use @df to specify that df is a variable.
    • Option-1.2: Here we use the other option df[df[1] > 0.5].
  • Option-2.x: Rename columns of the dataframe df by providing a dict: {0: 'A', 1: 'B', 2: 'C'}.

    • You can use df.query() in this case.
  • Option-3: Rename the columns of df using a dict-comprehension as C#, where # stands for the column number.

    • You can use df.query() in this case.
## Option-1: without renaming
#  Option-1.1: with query
df.query('@df[1] > 0.5')
#  Option-1.2: without using query
df[df[1] > 0.5]

## Option-2: rename columns (using a mapping provided manually)
#            columns = {0: 'A', 1: 'B', 2: 'C'}
df = pd.DataFrame(arr).rename(columns={0: 'A', 1: 'B', 2: 'C'})

#  Option-2.1
df[df['B'] > 0.5]

#  Option-2.2
df[df.B > 0.5]

#  Option-2.2
df.query('B > 0.5')

## Option-3: rename dynamically
df = pd.DataFrame(arr)
df = df.rename(columns=dict((x, 'C'+str(x)) for x in df.columns))
df.query('C1 > 0.5')

Output:

          0         1         2
3  0.413839  0.889178  0.564845
5  0.802746  0.941901  0.564068
6  0.904837  0.716764  0.151075
8  0.788026  0.749503  0.960260

Dummy Data

import pandas as pd
import numpy as np

arr = np.random.rand(10, 3)
df = pd.DataFrame(arr)

References

CypherX
  • 7,019
  • 3
  • 25
  • 37
  • @rossmpersonal I hope this answers your question. – CypherX Jul 09 '21 at 03:44
  • 1
    @henry-ecker This could be done using `@` for using variables inside `.query()`. – CypherX Jul 09 '21 at 04:10
  • Right, Using [df.query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html), You can refer to variables in the environment by prefixing them with an ‘@’ character. – Karn Kumar Jul 09 '21 at 04:35
0

There are ways to achieve what you are looking for:

Dummy DataFrame:

>>> df
          0         1         2
0  0.210862  0.894414  0.713472
1  0.804793  0.656390  0.842293
2  0.617104  0.763162  0.697050
3  0.158506  0.190683  0.740970
4  0.380092  0.984326  0.138277
5  0.665413  0.445192  0.525754
6  0.274770  0.870642  0.987045
7  0.619918  0.196403  0.221361
8  0.642992  0.572529  0.893655
9  0.101074  0.871377  0.130874

Solution:

In other way around if you work with unnamed columns where you can look for all the rows across DataFrame column as follows, but Keep in mind as it will keep non matching values as NaN while displaying all the matching ones.

>>> df[ df.iloc[:,df.columns]> 0.5 ]
          0         1         2
0       NaN  0.894414  0.713472
1  0.804793  0.656390  0.842293
2  0.617104  0.763162  0.697050
3       NaN       NaN  0.740970
4       NaN  0.984326       NaN
5  0.665413       NaN  0.525754
6       NaN  0.870642  0.987045
7  0.619918       NaN       NaN
8  0.642992  0.572529  0.893655
9       NaN  0.871377       NaN
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53