Finding the intersection between two series in Pandas

Question

I have two series s1 and s2 in pandas and want to compute the intersection i.e. where all of the values of the series are common.

How would I use the concat function to do this? I have been trying to work it out but have been unable to (I don't want to compute the intersection on the indices of s1 and s2, but on the values).

An example would be helpful to clarify what you're looking for - e.g. are you doing element-wise sets for a group of columns, or sets of all unique values along a column? I guess folks think the latter, using e.g. the example in the answer by eldad-a. — nealmcb, Jan 15 '22 at 01:07

score 108 · Accepted Answer · edited Jul 20 '18 at 14:28

108

Place both series in Python's set container then use the set intersection method:

s1.intersection(s2)

and then transform back to list if needed.

Just noticed pandas in the tag. Can translate back to that:

pd.Series(list(set(s1).intersection(set(s2))))

From comments I have changed this to a more Pythonic expression, which is shorter and easier to read:

Series(list(set(s1) & set(s2)))

should do the trick, except if the index data is also important to you.

Have added the list(...) to translate the set before going to pd.Series as pandas does not accept a set as direct input for a Series.

edited Jul 20 '18 at 14:28

Scott Martin

1,260
2
17
27

answered Aug 06 '13 at 12:05

Joop

7,840
9
43
58

2

also, you can use `&` operator for set intersection. – Andy Hayden Aug 06 '13 at 12:20
Actually, you can't just apply Series to a set (which is annoying) `TypeError: Set value is unordered`, seems unnecessary restriction/not very duck. – Andy Hayden Aug 06 '13 at 13:22
Mmm. used same logic while ago, but I probably moved it to list 1st... short calc so performance was not a major constraint. What it the syntax for using the & operator to do the set? – Joop Aug 06 '13 at 13:31
9

`set(s1) & set(s2)` :) – Andy Hayden Aug 06 '13 at 13:40
ahh.. thought the & was in pandas – Joop Aug 06 '13 at 13:49
14

FYI, can also do: ``s1[s1.isin(s2)]``, my timings show about the same perf – Jeff Aug 06 '13 at 13:49
You need ``Series(list(set(s1) & set(s2)))`` as the set result is unordered – Jeff Aug 06 '13 at 13:57
4

``isin`` keeps the ordering the same as s1, and somewhat faster on really large series – Jeff Aug 06 '13 at 13:57
1

@Jeff that was a considerably slower for me on the small example, but may make up for it with larger... drop_duplicates is *really* slow. – Andy Hayden Aug 06 '13 at 14:13
its better for > 100k elements (but dependeds on the density of matches too for some reason).... – Jeff Aug 06 '13 at 14:17
no need to convert `s2` to a `set`; `pd.Series(list(set(s1).intersection(s2)))` works (at least in pandas 0.25) – toto_tico Jan 06 '20 at 09:09

score 48 · Answer 2 · edited Aug 04 '22 at 04:41

48

Setup:

s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])

Timings:

%%timeit
pd.Series(list(set(s1).intersection(set(s2))))
10000 loops, best of 3: 57.7 µs per loop

%%timeit
pd.Series(np.intersect1d(s1,s2))
1000 loops, best of 3: 659 µs per loop

%%timeit
pd.Series(np.intersect1d(s1.values,s2.values))
10000 loops, best of 3: 64.7 µs per loop

So the numpy solution can be comparable to the set solution even for small series, if one uses the values explicitly.

edited Aug 04 '22 at 04:41

Asclepius

57,944
17
167
143

answered Jan 16 '14 at 23:33

eldad-a

3,051
3
22
25

10

redid test with newest numpy(1.8.1) and pandas (0.14.1) looks like your second example is now comparible in timeing to others. With larger data your last method is a clear winner 3 times faster than others – Joop Aug 13 '14 at 08:50
1

It's because the second one is 1000 loops and the rest are 10000 loops – asiehh Nov 02 '20 at 15:54

score 22 · Answer 3 · edited Feb 12 '19 at 03:04

22

If you are using Pandas, I assume you are also using NumPy. Numpy has a function intersect1d that will work with a Pandas series.

Example:

pd.Series(np.intersect1d(pd.Series([1,2,3,5,42]), pd.Series([4,5,6,20,42])))

will return a Series with the values 5 and 42.

edited Feb 12 '19 at 03:04

EJoshuaS - Stand with Ukraine

11,977
56
49
78

answered Aug 06 '13 at 12:26

jbn

651
3
7

2

FYI This is orders of magnitude slower that set. :( – Andy Hayden Aug 06 '13 at 12:56
For shame. @AndyHayden Is there a reason we can't add set ops to `Series` objects? – Phillip Cloud Aug 06 '13 at 14:53
Thanks, @AndyHayden. I had just naively assumed numpy would have faster ops on arrays. A quick `%timeit` test shows you to be _mostly_ correct. My method had an average of 775 us per loop on two Series of 100 randomly generated elements whereas @joop's method had 120 us per loop. However, for larger data sets, this relationship is reversed. On two sets of 100000 elements, my method showed 1.32 ms per loop and @joop's method showed 14.9 ms per loop. – jbn Aug 06 '13 at 15:53
very interesting, fyi @cpcloud opened an issue here https://github.com/pydata/pandas/issues/4480 – Andy Hayden Aug 06 '13 at 15:55
@jbn see my answer for how to get the numpy solution with comparable timing for short series as well. – eldad-a Jan 16 '14 at 23:34

Glen Thompson · Answer 4 · 2017-12-01T17:30:11.427

16

Python

s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])

s1[s1.isin(s2)]

R

s1  <- c(4,5,6,20,42)
s2 <- c(1,2,3,5,42)

s1[s1 %in% s2]

Edit: Doesn't handle dupes.

edited Dec 01 '17 at 17:30

answered Aug 25 '15 at 21:03

Glen Thompson

9,071
4
54
50

2

It won't handle duplicates correctly, at least the R code, don't know about python. In R there is `intersect` function, and for data.frame/data.table use `fintersect`. – jangorecki Mar 16 '16 at 18:40

score 6 · Answer 5 · edited Aug 04 '22 at 05:08

6

pd.merge can be used:

pd.merge(series1, series2, how='inner').drop_duplicates()

Note that the result is a dataframe.

edited Aug 04 '22 at 05:08

Asclepius

57,944
17
167
143

answered Oct 16 '16 at 20:51

kvb

625
3
8
12

Not anymore, as of v0.24. – cs95 Jan 23 '19 at 19:53
1

for anyone interested - in Dask it won't work, this solution will return AttributeError: 'Series' object has no attribute 'columns' – MikolajM Jun 17 '19 at 14:18

Hyperplane · Answer 6 · 2020-11-22T17:56:19.963

Here's another solution by checking both left and right inclusions

import pandas as pd

def intersect(left, right):
    left, right = pd.Series(pd.unique(left)), pd.Series(pd.unique(right))
    right = right.loc[right.isin(left)]
    left  =  left.loc[left.isin(right)]
    return pd.Series(pd.unique(left))

left = pd.Series([1,2,pd.NA, pd.NA, pd.NA], index=[*"abcde"], dtype="Int32")
right = pd.Series([pd.NA, pd.NA, 1, 3], index=[*"efgh"], dtype="Int32")
intersect(left, right)

This has 2 major advantages:

It works with pandas Int32 and other nullable data types. If your columns contain pd.NA then np.intersect1d throws an error!
It keeps the pandas dtype intact

you don't need the second line in this function – ludog Jan 06 '22 at 17:19 — ludog, Jan 06 '22 at 17:19

Finding the intersection between two series in Pandas

6 Answers6

Linked

Related