What is the most idiomatic way to index an object with a boolean array in pandas?

Question

I am particularly talking about Pandas version 0.11 as I am busy replacing my uses of .ix with either .loc or .iloc. I like the fact that differentiating between .loc and .iloc communicates whether I am intending to index by label or integer position. I see that either one will accept a boolean array as well but I would like to keep their usage pure to clearly communicate my intent.

Andy Hayden · Accepted Answer · 2013-05-19T18:15:33.253

In 11.0 all three methods work, the way suggested in the docs is simply to use df[mask]. However, this is not done on position, but purely using labels, so in my opinion loc best describes what's actually going on.

Update: I asked on github about this, the conclusion being that df.iloc[msk] will give a NotImplementedError (if integer indexed mask) or ValueError (if non-integer indexed) in pandas 11.1.

In [1]: df = pd.DataFrame(range(5), list('ABCDE'), columns=['a'])

In [2]: mask = (df.a%2 == 0)

In [3]: mask
Out[3]:
A     True
B    False
C     True
D    False
E     True
Name: a, dtype: bool

In [4]: df[mask]
Out[4]:
   a
A  0
C  2
E  4

In [5]: df.loc[mask]
Out[5]:
   a
A  0
C  2
E  4

In [6]: df.iloc[mask]  # Due to this question, this will give a ValueError (in 11.1)
Out[6]:
   a
A  0
C  2
E  4

Perhaps worth noting that if you gave mask integer index it would throw an error:

mask.index = range(5)
df.iloc[mask]  # or any of the others
IndexingError: Unalignable boolean Series key provided

This demonstrates that iloc isn't actually implemented, it uses label, hence why 11.1 will throw a NotImplementedError when we try this.

Thanks, I hadn't considered the .iloc behaviour with an integer index. To be honest I had actually forgotten that mask has an .index attribute in this case, I was thinking of it as purely a boolean numpy array. I agree that given that the .index attribute is actually used to do an alignment first, .loc is probably the best choice. — snth, May 17 '13 at 13:11

score 1 · Answer 2 · edited May 17 '13 at 07:38

1

I am currently using [], i.e. __getitem__(), e.g.

df = pd.DataFrame(dict(a=range(5)))
df[df.a%2==0]

edited May 17 '13 at 07:38

jamylak

128,818
30
231
230

answered May 17 '13 at 07:32

snth

5,194
4
39
48

Don't use the `dict` constructor, use `{'a': range(5)}` it looks much nicer – jamylak May 17 '13 at 07:38
dict literals also happen to be a lot faster. – Aryeh Leib Taurog May 28 '14 at 08:53
Good to know about the speed difference. I prefer the look of the `dict` constructor and also like that I don't have to put quotes around the keyword arguments which makes typing easier. However if the speed difference is big then perhaps I will switch to dict literals. – snth May 28 '14 at 11:28

What is the most idiomatic way to index an object with a boolean array in pandas?

2 Answers2

Linked