3

I have a Pandas dataframe as below. What I am trying to do is check if a station has variable yyy and any other variable on the same day (as in the case of station1). If this is true I need to delete the whole row containing yyy.

Currently I am doing this using iterrows() and looping to search the days in which this variable appears, changing the variable to something like "delete me", building a new dataframe from this (because pandas doesn't support replacing in place) and filtering the new dataframe to get rid of the unwanted rows. This works now because my dataframes are small, but is not likely to scale.

Question: This seems like a very "non-Pandas" way to do this, is there some other method of deleting out the unwanted variables?

                dateuse         station         variable1
0   2012-08-12 00:00:00        station1               xxx
1   2012-08-12 00:00:00        station1               yyy
2   2012-08-23 00:00:00        station2               aaa
3   2012-08-23 00:00:00        station3               bbb
4   2012-08-25 00:00:00        station4               ccc
5   2012-08-25 00:00:00        station4               ccc
6   2012-08-25 00:00:00        station4               ccc
Community
  • 1
  • 1
RedRaven
  • 725
  • 2
  • 18
  • 33

1 Answers1

4

I might index using a boolean array. We want to delete rows (if I understand what you're after, anyway!) which have yyy and more than one dateuse/station combination.

We can use transform to broadcast the size of each dateuse/station combination up to the length of the dataframe, and then select the rows in groups which have length > 1. Then we can & this with where the yyys are.

>>> multiple = df.groupby(["dateuse", "station"])["variable1"].transform(len) > 1
>>> must_be_isolated = df["variable1"] == "yyy"
>>> df[~(multiple & must_be_isolated)]
               dateuse   station variable1
0  2012-08-12 00:00:00  station1       xxx
2  2012-08-23 00:00:00  station2       aaa
3  2012-08-23 00:00:00  station3       bbb
4  2012-08-25 00:00:00  station4       ccc
5  2012-08-25 00:00:00  station4       ccc
6  2012-08-25 00:00:00  station4       ccc
DSM
  • 342,061
  • 65
  • 592
  • 494
  • This works, but I'm exactly sure why... Especially what does `df[~(multiple & must_be_isolated)]` do? – RedRaven Nov 07 '13 at 00:22
  • 1
    `print multiple` will show that it's a boolean array which is `True` whenever the row is a member of a `(dateuse, station`) group with length `> 1`. `must_be_isolated` is simply an array which is True whenever `variable1 == 'yyy'`. `multiple & must_be_isolated` is a booolean array which is `True` whenever there's something we want to be alone (here, `yyy`) which is in a group with more than one member. Those are the rows we want to remove. `~(mul.. & must_..)` is the negation of that, the rows we want to keep. `df[some_boolean_array_here]` selects those rows. Does that make sense? – DSM Nov 07 '13 at 01:57