4

I want to grab all of the non-nan IDs for SWISS-PROT-ID field and also the values that are True for idx_filter field. There's a million other ways I can think about doing this but I have been encountering this type of reflex when dealing with more complicated queries in my dataframes.

Is there a way to query dataframes that these types of names as the columns?

data = {'BLATTNER-ID': {'G0-16600': 'b4714', 'G6866': 'b1615', 'G0-10751': 'b4712', 'G0-10752': 'b4713', 'G6335': 'b0608', 'G6177': 'b0307', 'G0-8892': 'b4599', 'G0-10596': 'b4605', 'EG12861': 'b1915', 'EG12303': 'b1100'}, 'NAME': {'G0-16600': 'ralA', 'G6866': 'uidC', 'G0-10751': 'agrA', 'G0-10752': 'agrB', 'G6335': 'ybdR', 'G6177': 'ykgF', 'G0-8892': 'yneM', 'G0-10596': 'ypaB', 'EG12861': 'yecF', 'EG12303': 'ycfH'}, 'SWISS-PROT-ID': {'G0-16600': np.nan, 'G6866': 'Q47706', 'G0-10751': np.nan, 'G0-10752': np.nan, 'G6335': 'P77316', 'G6177': 'P77536', 'G0-8892': 'A5A616', 'G0-10596': np.nan, 'EG12861': 'P0AD07', 'EG12303': 'P0AFQ7'}, 'idx_filter': {'G0-16600': False, 'G6866': True, 'G0-10751': False, 'G0-10752': False, 'G6335': True, 'G6177': False, 'G0-8892': False, 'G0-10596': True, 'EG12861': False, 'EG12303': False}}
df = pd.DataFrame(data)

# 1st tried this
df.query("SWISS-PROT-ID and idx_filter").index
# UndefinedVariableError: name 'SWISS' is not defined
# 2nd tried escape characters
df.query("SWISS\-PROT\-ID and idx_filter").index
# UndefinedVariableError: name 'SWISS' is not defined

# Expecting
# ["G6335","G6866"]

Versions:

pandas: 0.22.0 # I don't want to upgrade b/c there is a serious bug in 0.23.0 that breaks one of my programs 
python: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
O.rka
  • 29,847
  • 68
  • 194
  • 309
  • `df.columns = [i.replace('-', '_') for i in df.columns]` would this be acceptable? I don't believe there is really a workaround here. However, there is the additional issue that your query won't work. – user3483203 Jun 07 '18 at 17:15
  • Yea that would definitely be one way but I was wondering if there was some weird syntactical magic in query strings that could handle this? – O.rka Jun 07 '18 at 17:16

2 Answers2

7

From pandas 0.25 onward you will be able to escape column names with backticks so you can do

df.query("`SWISS-PROT-ID` and idx_filter").index

blatantly stolen from : Pandas query function not working with spaces in column names

Roodseuk
  • 81
  • 1
  • 3
  • This solved my particular issue which as a column named `"1_1"`, interpreted as `11` without the backticks. – bli Oct 28 '22 at 12:37
2

Unfortunately, query uses pandas.eval (which supports arithmetic operations, so these operators are not allowed in column names), so there isn't a workaround for your desired outcome.

You have a couple options. First, you can simply replace the dashes with something else, like an underscore:

df.columns = [i.replace('-', '_') for i in df.columns]
df.query('SWISS_PROT_ID.notnull() & idx_filter')

Or you could simply index based on your condition:

df.loc[df['SWISS-PROT-ID'].notnull() & df.idx_filter]

Both produce (although one has renamed columns):

      BLATTNER-ID  NAME SWISS-PROT-ID  idx_filter
G6335       b0608  ybdR        P77316        True
G6866       b1615  uidC        Q47706        True
user3483203
  • 50,081
  • 9
  • 65
  • 94
  • 1
    Thanks, I like the 2nd option! Also, I have been looking for something analgous to `.notnull()` for a while. – O.rka Jun 07 '18 at 17:47