Suppose I have an pandas series that I'd like to function as a multimap (multiple values for each index key):
# intval -> data1
a = pd.Series(data=-np.arange(100000),
index=np.random.randint(0, 50000, 100000))
I'd like to select (as quickly as possible) all the values from a
where a
's index matches another index b
. (Like an inner join. Or a merge but for series).
a
may have duplicates in its index.b
may not have duplicates and it not necessarily a subset ofa
's index. To give pandas the best possible chance, let's assumeb
can also be provided as a sorted index object:
b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()
So, we would have something like:
target
a b result
3 0 3 3 0
3 1 7 8 3
4 2 8 ...
8 3 ...
9 4
...
I'm also only interested in getting the values of the result (index [3,8,...]
not needed).
If a
did not have duplicates, we would simply do:
a.reindex(b) # Cannot reindex a duplicate axis
Because &
maintains the duplicates of a
, we can't do:
d = a[a.index & b.index]
d = a.loc[a.index & b.index] # same
d = a.get(a.index & b.index) # same
print d.shape
So I think we need to do something like:
common = (a.index & b.index).unique()
a.loc[common]
... which is cumbersome, but also is surprising slow. It's not build the list of items to select that's slow:
%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop
... so it look like its really retrieving the values that's slow:
common = ((a.index & b).unique()).sort_values()
%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop
%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop
... That's around 20 operations per seconds. Not exactly zippy! Why so slow?
Surely there must be a fast way to lookup as set of values from pandas dataframe? I don't want to get an indexed object out -- really all I'm asking for is a merge over sorted indexes, or (slower) hashed int lookups. Either way, this should be an extremely fast operation -- not a 20 per second operation on my 3Ghz machine.
Also:
Profiling a.loc[common]
give:
ncalls tottime percall cumtime percall filename:lineno(function)
# All the time spent here.
40 1.01 0.02525 1.018 0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500 0.000582 3.88e-07 0.000832 5.547e-07 ~:0(<isinstance>)
PS. I posted a similar question previously, about why Series.map is so slow Why is pandas.series.map so shockingly slow? . The reason was lazy-under-the-hood-indexing. This doesn't seem to be happening here.
Update:
For similarly sizes a and common where a is unique:
% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop
... as @jpp points out. Multiindex is likely to blame.