4

I have a pandas df that has one of the columns as a tuple. I would like to use query to subset the df using the first entry of the tuple. What's the best way to do this? I'm on pandas .23.3, Python 3.6.6

MWE:

import pandas as pd
df = pd.DataFrame({"val": list(zip(range(9), range(9)[::-1]))})
df.query("val[0] > 3") #this line does not work!

I know that I can split the column up and then subset but I don't want to split it up.

update: for anyone who decides to go the route of unpacking the tuple and having two separate columns, here is a simple way to do this:

df["a"], df["b"] = list(zip(*df.val.tolist()))
Alex
  • 1,281
  • 1
  • 13
  • 26

2 Answers2

4

I assume your queries are more complicated than "val > 3". This is one easy way to get the first item from your column—with the .str accessor:

df.val.str[0].to_frame().query('val > 3')

   val
4    4
5    5
6    6
7    7
8    8

The reason this works is because .str will work with any object column (this includes columns of lists and tuples), not just strings (strings are considered objects, one of many possible types).
If query is not a necessity, this will be good enough:

v = df.val.str[0]
v[v > 3]

   val
4    4
5    5
6    6
7    7
8    8

There's also

pd.DataFrame({'val' : [v[0] for v in df['val']}).query('val > 3')

   val
4    4
5    5
6    6
7    7
8    8

Which uses a list comprehension to build a new single columned DataFrame from scratch. This should be the fastest, but I would prefer one of the approaches above.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    they are more complicated but this is interesting: how come `df.val.str[0]` returns int type rather than string? – Alex Aug 04 '18 at 22:12
  • @Alex `df.str` works on any object column (including lists and tuples), not just string (strings are considered objects too). – cs95 Aug 04 '18 at 22:12
  • oh i see so `.str` will simply return a series of objects in the `val` column and then you can you any method available on that object? (ie `__getitem__` in this case) – Alex Aug 04 '18 at 22:14
  • @Alex it returns a series, so all series methods are applicable here. For performance, you may want to do an astype(int) conversion because the output of .str[] is an object column. – cs95 Aug 04 '18 at 22:16
  • ok cool, this is good to know. my df obviously contains other columns but i just wanted to filter by the column that contains tuples, hence the reason for doing query. i suppose then i'll just boolean select the df using your method. thanks for the help! – Alex Aug 04 '18 at 22:17
2

What's the best way to do this?

In my opinion, don't work with a series of tuples to begin with. This negates one of the main benefits of Pandas: vectorised computations with NumPy arrays.

Instead, you can split your series of tuples into two series of integers. Then use pd.DataFrame.query as usual:

df = pd.DataFrame(df['val'].values.tolist()).add_prefix('val')

print(df.query('val0 > 3'))

   val0  val1
4     4     4
5     5     3
6     6     2
7     7     1
8     8     0
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    yes, i think you may be right: my tuple contains (year, month) values but i think it may just be best to have them in separate columns – Alex Aug 04 '18 at 22:24
  • just remembered why i never split up year and month: if i want to select all elements less than a particular year and month i can't do "year < my_year & month < my_month", becomes very ugly when cols are split up – Alex Aug 06 '18 at 22:12
  • Well, you can `df.query('year < 2010 & month < 5')` or `df[(df['year'] < 2010) & (df['month'] < 5)]`. – jpp Aug 06 '18 at 22:24
  • no that won't work: say the year is 2010 and month 5, and i want all sample points before. while year < 2010 will be true, the sample point in 2009-12 will fail since 12 > 5 – Alex Aug 06 '18 at 22:30
  • in R i represent year month points as integers of the form YYYYMM. i suppose i can do the same in pandas – Alex Aug 06 '18 at 22:31
  • @Alex, I see what you mean now. There is a solution, which is to use `datetime` [probably give up on `query` at this point], then use either the first or last day of each month. As long as you are consistent in your convention, you can then query like `df.loc[df['datetime'] < '2010-05-01']`, etc. [Or you can try this](https://stackoverflow.com/a/43896119/9209546). – jpp Aug 06 '18 at 22:31