159

I have a Series like this after doing groupby('name') and used mean() function on other column

name
383      3.000000
663      1.000000
726      1.000000
737      9.000000
833      8.166667

Could anyone please show me how to filter out the rows with 1.000000 mean values? Thank you and I greatly appreciate your help.

Kiem Nguyen
  • 2,467
  • 3
  • 16
  • 17

7 Answers7

194
In [5]:

import pandas as pd

test = {
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
}

s = pd.Series(test)
s = s[s != 1]
s
Out[0]:
383    3.000000
737    9.000000
833    8.166667
dtype: float64
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120
Andrew
  • 3,711
  • 2
  • 20
  • 17
  • 13
    I prefer the answers below because they can be chained (i.e. no need to define `s` and then use it twice in the expression). Only works from pandas 0.18 though. – IanS May 09 '17 at 07:56
  • Also see timing comparisons in piRSquared's [answer](http://stackoverflow.com/a/43401268/5276797). – IanS May 09 '17 at 07:58
100

From pandas version 0.18+ filtering a series can also be done as below

test = {
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
}

pd.Series(test).where(lambda x : x!=1).dropna()

Checkout: http://pandas.pydata.org/pandas-docs/version/0.18.1/whatsnew.html#method-chaininng-improvements

Leo Ufimtsev
  • 6,240
  • 5
  • 40
  • 48
DACW
  • 2,601
  • 2
  • 18
  • 16
  • 3
    So much nicer with method chaining (and reminds me of Spark.) – Dylan Hogg Aug 08 '17 at 09:14
  • 2
    True but Spark does something more intuitive in this case: it simply gets rid of rows that don't match the predicate, that means not using the ".dropna()" part which seemed clearly superfluous to me until I read the doc. Got bitten by that :D – Florent Moiny Mar 17 '20 at 13:53
77

As DACW pointed out, there are method-chaining improvements in pandas 0.18.1 that do what you are looking for very nicely.

Rather than using .where, you can pass your function to either the .loc indexer or the Series indexer [] and avoid the call to .dropna:

test = pd.Series({
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
})

test.loc[lambda x : x!=1]

test[lambda x: x!=1]

Similar behavior is supported on the DataFrame and NDFrame classes.

Nate Anderson
  • 18,334
  • 18
  • 100
  • 135
Gordon Bean
  • 4,272
  • 1
  • 32
  • 47
  • 3
    This is my favorite answer, and it also seems to be the fastest without going down to numpy (see timing comparisons). – IanS May 09 '17 at 08:04
30

A fast way of doing this is to reconstruct using numpy to slice the underlying arrays. See timings below.

mask = s.values != 1
pd.Series(s.values[mask], s.index[mask])

0
383    3.000000
737    9.000000
833    8.166667
dtype: float64

naive timing

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
7

Another way is to first convert to a DataFrame and use the query method (assuming you have numexpr installed):

import pandas as pd

test = {
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
}

s = pd.Series(test)
s.to_frame(name='x').query("x != 1")
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120
4

If you like a chained operation, you can also use compress function:

test = pd.Series({
383:    3.000000,
663:    1.000000,
726:    1.000000,
737:    9.000000,
833:    8.166667
})

test.compress(lambda x: x != 1)

# 383    3.000000
# 737    9.000000
# 833    8.166667
# dtype: float64
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 2
    Please note `pandas.Series.compress` is deprecated since version 0.24.0. of `pandas`. – ipj Jan 18 '21 at 17:56
2

In my case I had a pandas Series where the values are tuples of characters:

Out[67]
0    (H, H, H, H)
1    (H, H, H, T)
2    (H, H, T, H)
3    (H, H, T, T)
4    (H, T, H, H)

Therefore I could use indexing to filter the series, but to create the index I needed apply. My condition is "find all tuples which have exactly one 'H'".

series_of_tuples[series_of_tuples.apply(lambda x: x.count('H')==1)]

I admit it is not "chainable", (i.e. notice I repeat series_of_tuples twice; you must store any temporary series into a variable so you can call apply(...) on it).

There may also be other methods (besides .apply(...)) which can operate elementwise to produce a Boolean index.

Many other answers (including accepted answer) using the chainable functions like:

  • .compress()
  • .where()
  • .loc[]
  • []

These accept callables (lambdas) which are applied to the Series, not to the individual values in those series!

Therefore my Series of tuples behaved strangely when I tried to use my above condition / callable / lambda, with any of the chainable functions, like .loc[]:

series_of_tuples.loc[lambda x: x.count('H')==1]

Produces the error:

KeyError: 'Level H must be same as name (None)'

I was very confused, but it seems to be using the Series.count series_of_tuples.count(...) function , which is not what I wanted.

I admit that an alternative data structure may be better:

  • A Category datatype?
  • A Dataframe (each element of the tuple becomes a column)
  • A Series of strings (just concatenate the tuples together):

This creates a series of strings (i.e. by concatenating the tuple; joining the characters in the tuple on a single string)

series_of_tuples.apply(''.join)

So I can then use the chainable Series.str.count

series_of_tuples.apply(''.join).str.count('H')==1
MarianD
  • 13,096
  • 12
  • 42
  • 54
Nate Anderson
  • 18,334
  • 18
  • 100
  • 135