33

I have a very large data frame df that looks like:

ID       Value1    Value2
1345      3.2      332
1355      2.2      32
2346      1.0      11
3456      8.9      322

And I have a list that contains a subset of IDs ID_list. I need to have a subset of df for the ID contained in ID_list.

Currently, I am using df_sub=df[df.ID.isin(ID_list)] to do it. But it takes a lot time. IDs contained in ID_list doesn't have any pattern, so it's not within certain range. (And I need to apply the same operation to many similar dataframes. I was wondering if there is any faster way to do this. Will it help a lot if make ID as the index?

Thanks!

user3576212
  • 3,255
  • 9
  • 25
  • 33
  • Consider to use Binary Search. http://stackoverflow.com/questions/212358/binary-search-in-python – hoozecn May 30 '14 at 01:17
  • 2
    `isin` is implemented in Cython, so it's already pretty fast. You may have luck with `df.query(ID in @ID_list)` for larger dataframes. – TomAugspurger May 30 '14 at 02:17
  • show df.info(), and the len of ID_list. You might be swapping memory if your frame is really large. – Jeff May 30 '14 at 02:32
  • 1
    `isin` uses `set`, because of this, pandas need to convert every integer in ID column to integer object. What is the value range of `ID_list`, if it's not very large you can use `bincount(ID_list)` to create a lookup table. – HYRY May 30 '14 at 07:58
  • Confirmed through testing just now: isin gives same performance with Numba, Cython, and on its own with neither. – horcle_buzz Jan 31 '16 at 01:26
  • `df.query` and `pd.eval` seem like good fits for this use case. For information on the `pd.eval()` family of functions, their features and use cases, please visit [Dynamic Expression Evaluation in pandas using pd.eval()](https://stackoverflow.com/questions/53779986/dynamic-expression-evaluation-in-pandas-using-pd-eval). – cs95 Dec 16 '18 at 04:56

2 Answers2

46

EDIT 2: Here's a link to a more recent look into the performance of various pandas operations, though it doesn't seem to include merge and join to date.

https://github.com/mm-mansour/Fast-Pandas

EDIT 1: These benchmarks were for a quite old version of pandas and likely are not still relevant. See Mike's comment below on merge.

It depends on the size of your data but for large datasets DataFrame.join seems to be the way to go. This requires your DataFrame index to be your 'ID' and the Series or DataFrame you're joining against to have an index that is your 'ID_list'. The Series must also have a name to be used with join, which gets pulled in as a new field called name. You also need to specify an inner join to get something like isin because join defaults to a left join. query in syntax seems to have the same speed characteristics as isin for large datasets.

If you're working with small datasets, you get different behaviors and it actually becomes faster to use a list comprehension or apply against a dictionary than using isin.

Otherwise, you can try to get more speed with Cython.

# I'm ignoring that the index is defaulting to a sequential number. You
# would need to explicitly assign your IDs to the index here, e.g.:
# >>> l_series.index = ID_list
mil = range(1000000)
l = mil
l_series = pd.Series(l)

df = pd.DataFrame(l_series, columns=['ID'])


In [247]: %timeit df[df.index.isin(l)]
1 loops, best of 3: 1.12 s per loop

In [248]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 549 ms per loop

# index vs column doesn't make a difference here
In [304]: %timeit df[df.ID.isin(l_series)]
1 loops, best of 3: 541 ms per loop

In [305]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 529 ms per loop

# query 'in' syntax has the same performance as 'isin'
In [249]: %timeit df.query('index in @l')
1 loops, best of 3: 1.14 s per loop

In [250]: %timeit df.query('index in @l_series')
1 loops, best of 3: 564 ms per loop

# ID must be the index for DataFrame.join and l_series must have a name.
# join defaults to a left join so we need to specify inner for existence.
In [251]: %timeit df.join(l_series, how='inner')
10 loops, best of 3: 93.3 ms per loop

# Smaller datasets.
df = pd.DataFrame([1,2,3,4], columns=['ID'])
l = range(10000)
l_dict = dict(zip(l, l))
l_series = pd.Series(l)
l_series.name = 'ID_list'


In [363]: %timeit df.join(l_series, how='inner')
1000 loops, best of 3: 733 µs per loop

In [291]: %timeit df[df.ID.isin(l_dict)]
1000 loops, best of 3: 742 µs per loop

In [292]: %timeit df[df.ID.isin(l)]
1000 loops, best of 3: 771 µs per loop

In [294]: %timeit df[df.ID.isin(l_series)]
100 loops, best of 3: 2 ms per loop

# It's actually faster to use apply or a list comprehension for these small cases.
In [296]: %timeit df[[x in l_dict for x in df.ID]]
1000 loops, best of 3: 203 µs per loop

In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]
1000 loops, best of 3: 297 µs per loop
Manvir
  • 769
  • 1
  • 7
  • 15
chancyk
  • 358
  • 4
  • 10
  • 1
    This is a nice answer! I think this would make a nice new section here: http://pandas-docs.github.io/pandas-docs-travis/enhancingperf.html on performance of indexing operations (the indexing section is getting a bit long); want to do a pull request for this? – Jeff May 30 '14 at 03:38
  • Thank you all for help. I've try out these different methods and get back with results. But can how can I apply Cython to pandas? It's hard to declare type. – user3576212 May 30 '14 at 05:29
  • @user3576212, the link I provided gives a pretty good intro into using Cython with pandas: http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html. I would probably try to use join or hash tables before you get into that though. You may not get much gain anyway if you're just looking to do an isin type of operation. It's better if you have a somewhat unique algorithm you need to optimize. What kind of quantities are you working with? – chancyk May 30 '14 at 06:29
  • @Jeff, I can do that. Were you thinking to just include some benchmarks? Are there any other avenues you can think of that we should test for this purpose? – chancyk May 30 '14 at 15:16
  • Your answer is pretty good (you need to leave the benchmarks in code-blocks FYI); let's start with that (maybe a subsection on indexing performance or something) – Jeff May 30 '14 at 15:20
  • @Jeff, I've submitted a [pull request](https://github.com/pydata/pandas/pull/7398) but it may still need a few tweaks yet. – chancyk Jun 09 '14 at 03:26
  • 7
    I just wanted to add that (at leas in 18.1) I found that merge actually beats join, and you don't have to worry about messing with the index. Try something like `df.merge(pandas.DataFrame(value, columns=[dimension]), on=[dimension])`. At least for me it was able to accomplish the same task in 60% of the time. – Mike Sep 21 '16 at 20:32
  • 1
    @Mike What if we want to negate `isin`? – Rafael Almeida Apr 18 '18 at 09:47
  • 1
    @Mike, could you develop your comment into a detailed answer? I'd really appreciate! – phasselmann Jul 14 '19 at 10:26
  • 2
    @phasselmann sure, using the same syntax as the answer, you could convert the series to filter like `l_df = l_series.to_frame()` then you could filter on the df with `filtered_df = df.merge(l_df, left_index=True, right_index=True)` Whereas the join solution takes 10.2 ms for me, the merge solution takes 6.7 ms. – Mike Jul 15 '19 at 18:45
5

Yes, isin is quite slow.

Instead it's faster to make ID an index then use use loc, like:

df.set_index('ID', inplace=True)
df.loc[list_of_indices]

Actually what brought me to this page was that I needed to create a label in my df based on index in another df: "if df_1's index matches df_2's index, label it a 1, otherwise NaN", which I accomplished like this:

df_2['label'] = 1  # Create a label column
df_1.join(df_2['label'])

Which is also very fast.

Nic Scozzaro
  • 6,651
  • 3
  • 42
  • 46
  • 3
    very nice idea; although this assumes that the field for indexing is unique (PK); which may not be true. – Neil Feb 22 '22 at 00:36