67

Problem

Given data in a Pandas DataFrame like the following:

Name     Amount
---------------
Alice       100
Bob          50
Charlie     200
Alice        30
Charlie      10

I want to select all rows where the Name is one of several values in a collection {Alice, Bob}

Name     Amount
---------------
Alice       100
Bob          50
Alice        30

Question

What is an efficient way to do this in Pandas?

Options as I see them

  1. Loop through rows, handling the logic with Python
  2. Select and merge many statements like the following

    merge(df[df.name = specific_name] for specific_name in names) # something like this
    
  3. Perform some sort of join

What are the performance trade-offs here? When is one solution better than the others? What solutions am I missing?

While the example above uses strings my actual job uses matches on 10-100 integers over millions of rows and so fast NumPy operations may be relevant.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    "my actual job is numeric": what your *actual* problem is matters a lot. If you need to have a tolerance for floats, for example, you won't be able to use the same solution as you would for strings. – DSM Mar 18 '14 at 16:32
  • @DSM edited to specify that I'm matching on integers. Also added scale of problem. – MRocklin Mar 18 '14 at 16:35

2 Answers2

107

You can use the isin Series method:

In [11]: df['Name'].isin(['Alice', 'Bob'])
Out[11]: 
0     True
1     True
2    False
3     True
4    False
Name: Name, dtype: bool

In [12]: df[df.Name.isin(['Alice', 'Bob'])]
Out[12]: 
    Name  Amount
0  Alice     100
1    Bob      50
3  Alice      30
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    This is indeed the interface I'm looking for. How does it work under the hood? – MRocklin Mar 18 '14 at 16:42
  • Looks like it's using a Cython for loop over the data and `set.__contains__` to test membership – MRocklin Mar 18 '14 at 16:54
  • Yup, checkout pandas.lib.ismember, it's as you say! Interesting that it uses set rather than khash. – Andy Hayden Mar 18 '14 at 17:03
  • Is it possible to specify a regex here for the value that is checked. Suppose the value in a row for a particular column in the table is 'hello world foo bar' and I need to return this row if the string 'foo' is present in the column. – Aditya Nov 29 '16 at 07:10
  • 1
    @aditya yes, use [`df['Name'].str.contains('foo')`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html). – Andy Hayden Nov 29 '16 at 08:05
  • Thanks a lot @AndyHayden – Aditya Nov 29 '16 at 08:07
  • Thanks a lot. This is the cleanest solution. – Roy Feb 10 '20 at 22:53
8

Since, in your actual use case, the values in df['Name'] are ints, you might be able to generate the boolean mask faster using NumPy indexing instead of Series.isin.

idx = np.zeros(N, dtype='bool')
idx[names] = True
df[idx[df['Name'].values]]

For example, given this setup:

import pandas as pd
import numpy as np

N = 100000
df = pd.DataFrame(np.random.randint(N, size=(10**6, 2)), columns=['Name', 'Amount'])
names = np.random.choice(np.arange(N), size=100, replace=False)

In [81]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
100 loops, best of 3: 9.88 ms per loop

In [82]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 107 ms per loop

In [83]: 107/9.88
Out[83]: 10.82995951417004

N is (essentially) the maximum value that df['Names'] can attain. If N is smaller, the speed benefit is not as large. With N = 200,

In [93]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
10 loops, best of 3: 62.6 ms per loop

In [94]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 178 ms per loop

In [95]: 178/62.6
Out[95]: 2.8434504792332267

Caution: As shown above, there seems to be a speed benefit, particularly as N gets large. However, if N is too large, then forming idx = np.zeros(N, dtype='bool') may not be feasible.


Sanity check:

expected = df[df.Name.isin(names)]
idx = np.zeros(N, dtype='bool')
idx[names] = True
result = df[idx[df['Name'].values]]
assert expected.equals(result)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I wonder if perf of isin could be improved using khash rather than set in lib.ismember for large sets (esp. for int/non-object use case). – Andy Hayden Mar 18 '14 at 18:41
  • @AndyHayden: Unfortunately, I know almost nothing about khash. Where can I learn about it? – unutbu Mar 18 '14 at 19:09
  • Well, I've just been checking out the source in pandas's hashtables. Looks like could use e.g `h = pd.hashtable.Int64HashTable()`. – Andy Hayden Mar 18 '14 at 19:18
  • Your way is still **significantly** faster (though certainly HashTable is significantly better than set). – Andy Hayden Mar 18 '14 at 20:17
  • Actually looking back at your timeits something weird was going on earlier, will have to check this out again. (I was seeing your method a thousand times faster, which can't be right and isn't what you see) – Andy Hayden Mar 18 '14 at 20:44
  • df['Name'] are not 'ints'. – RukTech May 01 '15 at 21:33
  • @RukTech: The OP's question states (at the end), "While the example above uses strings my actual job uses matches on 10-100 integers over millions of rows and so fast NumPy operations may be relevant." – unutbu May 01 '15 at 21:38
  • Even then the purpose is to find specific strings or 'ints'. Using a randomly generated index (names = np.random.choice(np.arange(N), size=100, replace=False)) to access certain rows of the dataframe will be fast. The purpose to find targeted rows of the dataframe. – RukTech May 01 '15 at 21:42
  • This is very interesting. I compared use of `isin` against indexing with Numpy, and for my case, where N varied between 14 and 6000 per data frame, `isin` was always faster. (My use case is similar to the OP, but for mine, I am matching integer columns in 2 data frames, against each other). – horcle_buzz Feb 01 '16 at 19:10
  • Small note: instead of `idx = np.zeros(N, dtype='bool')` followed by `idx[names] = True` one can use `idx = np.ones(N, dtype='bool')` – soegaard Aug 05 '22 at 09:34