39

In Python generally, membership of a hashable collection is best tested via set. We know this because the use of hashing gives us O(1) lookup complexity versus O(n) for list or np.ndarray.

In Pandas, I often have to check for membership in very large collections. I presumed that the same would apply, i.e. checking each item of a series for membership in a set is more efficient than using list or np.ndarray. However, this doesn't seem to be the case:

import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(100000)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

%timeit ser.isin(x_set)                   # 8.9 ms
%timeit ser.isin(x_arr)                   # 2.17 ms
%timeit ser.isin(x_list)                  # 7.79 ms
%timeit np.in1d(arr, x_arr)               # 5.02 ms
%timeit [i in x_set for i in lst]         # 1.1 ms
%timeit [i in x_set for i in ser.values]  # 4.61 ms

Versions used for testing:

np.__version__  # '1.14.3'
pd.__version__  # '0.23.0'
sys.version     # '3.6.5'

The source code for pd.Series.isin, I believe, utilises numpy.in1d, which presumably means a large overhead for set to np.ndarray conversion.

Negating the cost of constructing the inputs, the implications for Pandas:

  • If you know your elements of x_list or x_arr are unique, don't bother converting to x_set. This will be costly (both conversion and membership tests) for use with Pandas.
  • Using list comprehensions are the only way to benefit from O(1) set lookup.

My questions are:

  1. Is my analysis above correct? This seems like an obvious, yet undocumented, result of how pd.Series.isin has been implemented.
  2. Is there a workaround, without using a list comprehension or pd.Series.apply, which does utilise O(1) set lookup? Or is this an unavoidable design choice and/or corollary of having NumPy as the backbone of Pandas?

Update: On an older setup (Pandas / NumPy versions) I see x_set outperform x_arr with pd.Series.isin. So an additional question: has anything fundamentally changed from old to new to cause performance with set to worsen?

%timeit ser.isin(x_set)                   # 10.5 ms
%timeit ser.isin(x_arr)                   # 15.2 ms
%timeit ser.isin(x_list)                  # 9.61 ms
%timeit np.in1d(arr, x_arr)               # 4.15 ms
%timeit [i in x_set for i in lst]         # 1.15 ms
%timeit [i in x_set for i in ser.values]  # 2.8 ms

pd.__version__  # '0.19.2'
np.__version__  # '1.11.3'
sys.version     # '3.6.0'
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Yes. Peruse the source code yourself: [`np.in1d`](https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L434-L543)... by the way, what is `x` in your question? – cs95 Jun 10 '18 at 00:38
  • 1
    Notice that `in1d` is only used for Series of size greater than 1000000. – miradulo Jun 10 '18 at 00:39
  • 1
    There is a call to `np.unique` in there, so calling `set` yourself does not make a difference. – cs95 Jun 10 '18 at 00:39
  • @coldspeed, `x` is `x_set`, have updated. Yeh, it seems strange that there *isn't* a way to implement O(1) lookup in Pandas. I thought there might be some obscure way to do this. – jpp Jun 10 '18 at 00:40
  • Your assumption hinges on the fact that there is something along the lines of `[x in myvalues for x in Series]` but that doesn't scale at all, so they use something else where it becomes irrelevant what myvalues is (set or list). – cs95 Jun 10 '18 at 00:41
  • @coldspeed, So it's a design choice. Have one solution for `set` or `list` and sacrifice O(1) lookup? If that's the case, I accept it. Just seems a strange sacrifice when it theoretically should be trivial to avoid `np.unique` via an optional parameter. – jpp Jun 10 '18 at 00:45
  • 1
    When `myvalues` is so small that it doesn't matter, then the O(1) lookup is irrelevant. When `myvalues` is big enough that the O(1) lookup still isn't enough... well that's where the unique + merge sort kicks in. It's ingenious imo. – cs95 Jun 10 '18 at 00:46
  • 2
    Your randint doesn't go all the way up to the set's max, you might see different numbers if you use, say, 200000. Using numpy.in1d seems like an odd choice. But in general... when does this come up? (When do you test contains on a large set?) Another workaround is to use `ser.apply(x_set.__contains__)`. It's strange, as I thought there was a pandas internal dict-like datastructure that could be used in cases like this (khash?). – Andy Hayden Jun 10 '18 at 01:52
  • 3
    I thought I could do better with `x_idx = pd.RangeIndex(100000); %timeit ser.isin(x_idx)` but maddeningly it is slower than all your methods. It seems intuition doesn't work here. – John Zwinck Jun 10 '18 at 02:35
  • `set` & `frozenset` will also consume much more memory than numpy arrays. for really big datasets, processing on single machine may not even be possible with python native `set` & `frozenset`. I recently faced the memory issue trying to remove sets that that are subsets of others in a list of 1,000,000 sets – Haleemur Ali Jun 12 '18 at 18:54

1 Answers1

62

This might not be obvious, but pd.Series.isin uses O(1)-look up per element.

After an analysis, which proves the above statement, we will use its insights to create a Cython-prototype which can easily beat the fastest out-of-the-box-solution.


Let's assume that the "set" has n elements and the "series" has m elements. The running time is then:

 T(n,m)=T_preprocess(n)+m*T_lookup(n)

For the pure-python version, that means:

  • T_preprocess(n)=0 - no preprocessing needed
  • T_lookup(n)=O(1) - well known behavior of python's set
  • results in T(n,m)=O(m)

What happens for pd.Series.isin(x_arr)? Obviously, if we skip the preprocessing and search in linear time we will get O(n*m), which is not acceptable.

It is easy to see with help of a debugger or a profiler (I used valgrind-callgrind+kcachegrind), what is going on: the working horse is the function __pyx_pw_6pandas_5_libs_9hashtable_23ismember_int64. Its definition can be found here:

  • In a preprocessing step, a hash-map (pandas uses khash from klib) is created out of n elements from x_arr, i.e. in running time O(n).
  • m look-ups happen in O(1) each or O(m) in total in the constructed hash-map.
  • results in T(n,m)=O(m)+O(n)

We must remember - the elements of numpy-array are raw-C-integers and not the Python-objects in the original set - so we cannot use the set as it is.

An alternative to converting the set of Python-objects to a set of C-ints, would be to convert the single C-ints to Python-object and thus be able to use the original set. That is what happens in [i in x_set for i in ser.values]-variant:

  • No preprocessing.
  • m look-ups happen in O(1) time each or O(m) in total, but the look-up is slower due to necessary creation of a Python-object.
  • results in T(n,m)=O(m)

Clearly, you could speed-up this version a little bit by using Cython.

But enough theory, let's take a look at the running times for different ns with fixed ms:

enter image description here

We can see: the linear time of preprocessing dominates the numpy-version for big ns. The version with conversion from numpy to pure-python (numpy->python) has the same constant behavior as the pure-python version but is slower, because of the necessary conversion - this all in accordance with our analysis.

That cannot be seen well in the diagram: if n < m the numpy version becomes faster - in this case the faster look-up of khash-lib plays the most important role and not the preprocessing-part.

My take-aways from this analysis:

  • n < m: pd.Series.isin should be taken because O(n)-preprocessing isn't that costly.

  • n > m: (probably cythonized version of) [i in x_set for i in ser.values] should be taken and thus O(n) avoided.

  • clearly there is a gray zone where n and m are approximately equal and it is hard to tell which solution is best without testing.

  • If you have it under your control: The best thing would be to build the set directly as a C-integer-set (khash (already wrapped in pandas) or maybe even some c++-implementations), thus eliminating the need for preprocessing. I don't know, whether there is something in pandas you could reuse, but it is probably not a big deal to write the function in Cython.


The problem is that the last suggestion doesn't work out of the box, as neither pandas nor numpy have a notion of a set (at least to my limited knowledge) in their interfaces. But having raw-C-set-interfaces would be best of both worlds:

  • no preprocessing needed because values are already passed as a set
  • no conversion needed because the passed set consists of raw-C-values

I've coded a quick and dirty Cython-wrapper for khash (inspired by the wrapper in pandas), which can be installed via pip install https://github.com/realead/cykhash/zipball/master and then used with Cython for a faster isin version:

%%cython
import numpy as np
cimport numpy as np

from cykhash.khashsets cimport Int64Set

def isin_khash(np.ndarray[np.int64_t, ndim=1] a, Int64Set b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

As a further possibility the c++'s unordered_map can be wrapped (see listing C), which has the disadvantage of needing c++-libraries and (as we will see) is slightly slower.

Comparing the approaches (see listing D for creating of timings):

enter image description here

khash is about factor 20 faster than the numpy->python, about factor 6 faster than the pure python (but pure-python is not what we want anyway) and even about factor 3 faster than the cpp's-version.


Listings

1) profiling with valgrind:

#isin.py
import numpy as np
import pandas as pd

np.random.seed(0)

x_set = {i for i in range(2*10**6)}
x_arr = np.array(list(x_set))


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)


for _ in range(10):
   ser.isin(x_arr)

and now:

>>> valgrind --tool=callgrind python isin.py
>>> kcachegrind

leads to the following call graph:

enter image description here

B: ipython code for producing the running times:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)

arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    x_list = list(x_set)

    t1=%timeit -o  ser.isin(x_arr) 
    t2=%timeit -o  [i in x_set for i in lst]
    t3=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average])
    n*=2

#plotting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='numpy')
plt.plot(for_plot[:,0], for_plot[:,2], label='python')
plt.plot(for_plot[:,0], for_plot[:,3], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
plt.legend()
plt.show()

C: cpp-wrapper:

%%cython --cplus -c=-std=c++11 -a

from libcpp.unordered_set cimport unordered_set

cdef class HashSet:
    cdef unordered_set[long long int] s
    cpdef add(self, long long int z):
        self.s.insert(z)
    cpdef bint contains(self, long long int z):
        return self.s.count(z)>0

import numpy as np
cimport numpy as np

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)

def isin_cpp(np.ndarray[np.int64_t, ndim=1] a, HashSet b):
    cdef np.ndarray[np.uint8_t,ndim=1, cast=True] res=np.empty(a.shape[0],dtype=np.bool)
    cdef int i
    for i in range(a.size):
        res[i]=b.contains(a[i])
    return res

D: plotting results with different set-wrappers:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from cykhash import Int64Set

np.random.seed(0)

x_set = {i for i in range(10**2)}
x_arr = np.array(list(x_set))
x_list = list(x_set)


arr = np.random.randint(0, 20000, 10000)
ser = pd.Series(arr)
lst = arr.tolist()

n=10**3
result=[]
while n<3*10**6:
    x_set = {i for i in range(n)}
    x_arr = np.array(list(x_set))
    cpp_set=HashSet()
    khash_set=Int64Set()

    for i in x_set:
        cpp_set.add(i)
        khash_set.add(i)


    assert((ser.isin(x_arr).values==isin_cpp(ser.values, cpp_set)).all())
    assert((ser.isin(x_arr).values==isin_khash(ser.values, khash_set)).all())


    t1=%timeit -o  isin_khash(ser.values, khash_set)
    t2=%timeit -o  isin_cpp(ser.values, cpp_set) 
    t3=%timeit -o  [i in x_set for i in lst]
    t4=%timeit -o  [i in x_set for i in ser.values]

    result.append([n, t1.average, t2.average, t3.average, t4.average])
    n*=2

#ploting result:
for_plot=np.array(result)
plt.plot(for_plot[:,0], for_plot[:,1], label='khash')
plt.plot(for_plot[:,0], for_plot[:,2], label='cpp')
plt.plot(for_plot[:,0], for_plot[:,3], label='pure python')
plt.plot(for_plot[:,0], for_plot[:,4], label='numpy->python')
plt.xlabel('n')
plt.ylabel('running time')
ymin, ymax = plt.ylim()
plt.ylim(0,ymax)
plt.legend()
plt.show()
ead
  • 32,758
  • 6
  • 90
  • 153
  • 1
    This is brilliant. I made the rookie error of assuming complexity from timings. I like the way you explained the trade-off between hash-map creation time & performance for `n < m` versus performance for `n > m`. – jpp Jun 15 '18 at 22:46
  • 20
    Boy, this answer is longer than my phd thesis. – anishtain4 Jun 20 '18 at 17:35
  • 1
    i could give you a lifetime beer if you want.. just really really good explanation damn! – nickanor Oct 26 '22 at 02:47