8

Suppose I have an pandas series that I'd like to function as a multimap (multiple values for each index key):

# intval -> data1
a = pd.Series(data=-np.arange(100000),
              index=np.random.randint(0, 50000, 100000))

I'd like to select (as quickly as possible) all the values from a where a's index matches another index b. (Like an inner join. Or a merge but for series).

  • a may have duplicates in its index.
  • b may not have duplicates and it not necessarily a subset of a's index. To give pandas the best possible chance, let's assume b can also be provided as a sorted index object:
     b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()

So, we would have something like:

                      target  
   a        b         result
3  0        3      3  0
3  1        7      8  3 
4  2        8      ...     
8  3      ...
9  4
...

I'm also only interested in getting the values of the result (index [3,8,...] not needed).

If a did not have duplicates, we would simply do:

a.reindex(b)  # Cannot reindex a duplicate axis

Because & maintains the duplicates of a, we can't do:

d = a[a.index & b.index]
d = a.loc[a.index & b.index]  # same
d = a.get(a.index & b.index)  # same
print d.shape

So I think we need to do something like:

common = (a.index & b.index).unique()
a.loc[common]

... which is cumbersome, but also is surprising slow. It's not build the list of items to select that's slow:

%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop

... so it look like its really retrieving the values that's slow:

common = ((a.index & b).unique()).sort_values()

%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop

%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop

... That's around 20 operations per seconds. Not exactly zippy! Why so slow?

Surely there must be a fast way to lookup as set of values from pandas dataframe? I don't want to get an indexed object out -- really all I'm asking for is a merge over sorted indexes, or (slower) hashed int lookups. Either way, this should be an extremely fast operation -- not a 20 per second operation on my 3Ghz machine.


Also:

Profiling a.loc[common] give:

ncalls  tottime  percall  cumtime   percall filename:lineno(function)
# All the time spent here.
40      1.01     0.02525  1.018     0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500    0.000582 3.88e-07 0.000832  5.547e-07 ~:0(<isinstance>)

PS. I posted a similar question previously, about why Series.map is so slow Why is pandas.series.map so shockingly slow? . The reason was lazy-under-the-hood-indexing. This doesn't seem to be happening here.


Update:

For similarly sizes a and common where a is unique:

% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop

... as @jpp points out. Multiindex is likely to blame.

user48956
  • 14,850
  • 19
  • 93
  • 154
  • 2
    I've given half an answer, i.e. "why is it slow". As to "how can it be made faster [with non-unique indices]", I think this is non-trivial with Pandas. Do leave this question open. – jpp Feb 02 '19 at 02:44
  • 1
    @coldspeed, I thought there was a second part to this question (i.e. how to speed it up with non-unique indices), which is why I advised OP not to accept my answer. But maybe that makes it too broad.. not sure, but it's a valid and concrete concern. – jpp Feb 02 '19 at 08:30
  • 1
    @jpp I would recommend adding a cumcounted index to form a unique multiindex. Should hopefully make it constant time (since the indices are unique). I'll wait a bit for OP to clarify (their second part isn't clear) and will reopen. – cs95 Feb 02 '19 at 09:22
  • Re: duplicate question. My question comes without that knowledge of what is causing the slow behavior. It seems related, but not a duplicate of the referenced question (which pointed our non unique indexes are slow in the question). – user48956 Feb 02 '19 at 23:53
  • + Am also looking for how to resolve this issue (referenced question does not). Re, cumcounted index. Sounds like the right fix -- though not obvious how to use the index without looping in Python. Will think on it though. – user48956 Feb 03 '19 at 00:18

1 Answers1

4

Repeated indices are guaranteed to slow down your dataframe indexing operations. You can amend your inputs to prove this to yourself:

a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common]  # 34.1 ms

a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common]  # 6.86 ms

As mentioned in this related question:

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Hmmm... pretty sure these could all be O(1) . – user48956 Feb 02 '19 at 03:58
  • @user48956 Can you elaborate on that? – MrR Mar 05 '21 at 16:56
  • 1
    @MrR Are you asking why I think hash lookups can be O(1)? That's simply a famous result in computer science. It's possible to achieve O(1) lookup costs no matter whether the index is sorted or not. Pandas chose not to. (sad trombone) – user48956 Mar 09 '21 at 00:03
  • No, I wasn't asking about that. I was asking about elaborating on why you think that pandas choosing binary search on non-unique indexes as a default is unreasonable. What are the tradeoffs? Range-based operations spring to mind. Also having to rebuild the entire index when your index size exceeds a hash threshold size. Granted these same considerations might also apply to unique indexes, but perhaps there is some assumption here that the use cases for non-unique indexes will be slightly different. Or, are we saying the pandas developers got this completely wrong for no apparent reason? – MrR Mar 10 '21 at 10:50