0

One way to test if a value occurs in a column in a Pandas data frame is to count the occurrences of each value, so for example if we want to check for the occurrence of 'dog' in the column 'pet' we could do this

at_least_one_dog = df['pet'].value_counts()['dog'] > 0

or we could check that it is in the list of all the values in the column like this

at_least_one_dog = 'dog' in df['pet'].unique()

But both of these methods process the entire column before producing an answer, which can be time consuming for a very large data frame. Is there a way to test for the occurrence of a value in a column that returns True immediately the first occurrence is encountered without necessarily going through the whole column?

dumbledad
  • 16,305
  • 23
  • 120
  • 273
  • Not sure if 100% dupe, but close, what need is [this](https://stackoverflow.com/questions/53020764/efficiently-return-the-index-of-the-first-value-satisfying-condition-in-array) – jezrael Dec 10 '20 at 10:09
  • but unfortunately not working with strings – jezrael Dec 10 '20 at 10:20

1 Answers1

0

Python's in operator (aka contains) stops at first occurrence, see CPython implementation. So you could use at_least_one_dog = 'dog' in df['pet'].to_list().
The bad thing, however, is that the conversion to_list() itself takes longer than looping through this list. That's why you don't see any difference in the performance plots between the cases where 'dog' is in the first or in the last row for larger dataframes. But anyway in or isin is about 10 to 100 times faster than evaluating all elements with value_counts or unique.

import pandas as pd
from  pandas.util.testing import makeStringIndex
import perfplot

def setup(n, where):
    df = pd.DataFrame({'pet': makeStringIndex(n)})
    df.iloc[where,0] = 'dog'
    return df

perfplot.show(
    setup=lambda n: setup(n, 0),  # 0 = start, -1 = end
    kernels=[
        lambda df: df['pet'].value_counts()['dog'] > 0,
        lambda df: 'dog' in df['pet'].unique(),
        lambda df: 'dog' in df['pet'].to_list(),
        lambda df: 'dog' in df['pet'].to_numpy(),
        lambda df: df['pet'].isin(['dog']).any(),
    ],
    labels=['value_counts > 0', 'in unique', 'in list', 'in array', 'isin' ],
    n_range=[10**k for k in range(6,8)],
    xlabel='len(df)',
    title="'dog' at start position"
)

enter image description here

Stef
  • 28,728
  • 2
  • 24
  • 52