19

I have a dataframe df with a very long column of random positive integers:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

I want to determine the index of the first even number in the column. One way to do this is:

df[df.n % 2 == 0].iloc[0]

but this involves a lot of operations (generate the indices f.n % 2 == 0, evaluate df on those indices and finally take the first item) and is very slow. A loop like this is much quicker:

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break

also because the first result will be probably in the first few lines. Is there any pandas method for doing this with similar performance? Thank you.

NOTE: This condition (to be an even number) is just an example. I'm looking for a solution that works for any kind of condition on the values, i.e., for a fast one-line alternative to:

df[ conditions on df.n ].iloc[0]
peter
  • 421
  • 1
  • 3
  • 9
  • 1
    why dont you just use that loop? – R Nar Dec 01 '17 at 20:39
  • Is the column sorted? If so, you can try `np.searchsorted`. If not, other than pre-sorting I don't think there is any vectorized solution for this. – ayhan Dec 01 '17 at 20:39
  • @RNar: I'm learning pandas and I'd like to know how to do this within pandas @ayhan: Yes the column is sorted. But how can I specify complex conditions using `np.searchsorted`? For example, how to find the first even number? – peter Dec 01 '17 at 20:42
  • Yeah that wouldn't work on finding even numbers. I mean you can compute the modulo for the array but that's something you want to avoid. Here are some relevant discussions: [1](https://stackoverflow.com/questions/41320568/what-is-the-most-efficient-way-to-find-the-position-of-the-first-np-nan-value), [2](https://stackoverflow.com/questions/45771554/why-isnt-numpy-any-lazy-short-circuiting) – ayhan Dec 01 '17 at 20:48
  • 1
    If the condition is usually satisfied in the first few rows as you say, then you could do `df.iloc[:x,df.A > 3.5].iloc[0]` to only search the first X rows. If that misses, search next X rows, etc. Depending on your data and choice of X that ought to be fast. Otherwise I would probably try the numba function in one of the answers linked to by ayhan – JohnE Dec 01 '17 at 21:17
  • I have edited the question to make it more clear, now some comments make no sense (sorry). @JohnE Yes that would be fast, but I'd like to avoid any loop in my code, if possible... – peter Dec 01 '17 at 21:21
  • 1
    At the end of the day `conditions on df.n` is a _very_ broad ask and there are different operations based on exactly what condition is. Regardless, it's going to be tough to get away from elementwise comparison to the Series/column. The `.iloc[0]` or whatever else you're tacking on the end isn't the expensive part. – Brad Solomon Dec 01 '17 at 22:04
  • I think you must do either: 1) use vector logic that computes entire columns (in which case JohnE's comment is useful), or 2) loop through the first rows until your `condition(row)` is met. – Thomas Fauskanger Dec 01 '17 at 22:35
  • Something to keep in mind when looping: when accessing a **single** value of the dataframe it is better to use `at` and `iat` instead of `loc` and `iloc`. Source: [Different ways to iterate over rows in a Pandas Dataframe — performance comparison](https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe). – rocarvaj Feb 22 '19 at 18:52

5 Answers5

12

I decided for fun to play with a few possibilities. I take a dataframe:

MAX = 10**7
df = pd.DataFrame({'n': range(MAX)})

(not random this time.) I want to find the first row for which n >= N for some value of N. I have timed the following four versions:

def getfirst_pandas(condition, df):
    return df[condition(df)].iloc[0]

def getfirst_iterrows_loop(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None

def getfirst_for_loop(condition, df):
    for j in range(len(df)):
        if condition(df.iloc[j]):
            break
    return j

def getfirst_numpy_argmax(condition, df):
    array = df.as_matrix()
    imax  = np.argmax(condition(array))
    return df.index[imax]

with N = powers of ten. Of course the numpy (optimized C) code is expected to be faster than for loops in python, but I wanted to see for which values of N python loops are still okay.

I timed the lines:

getfirst_pandas(lambda x: x.n >= N, df)
getfirst_iterrows_loop(lambda x: x.n >= N, df)
getfirst_for_loop(lambda x: x.n >= N, df)
getfirst_numpy_argmax(lambda x: x >= N, df.n)

for N = 1, 10, 100, 1000, .... This is the log-log graph of the performance:

PICTURE

The simple for loop is ok as long as the "first True position" is expected to be at the beginning, but then it becomes bad. The np.argmax is the safest solution.

As you can see from the graph, the time for pandas and argmax remain (almost) constant, because they always scan the whole array. It would be perfect to have a np or pandas method which doesn't.

peter
  • 421
  • 1
  • 3
  • 9
  • 1
    at least someone mentioned that complexity on for loop will depend on where does the expected result lays... and yes OP's solution is not the quickest in general as claimed by the most upvoted answer ... – user59271 Apr 24 '20 at 06:16
6

Did some timings and yes using a generator will normally give you quicker results

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 10000)})

%timeit df[df.n % 2 == 0].iloc[0]
%timeit df.iloc[next(k for k,v in df.iterrows() if v.n % 2 == 0)]
%timeit df.iloc[next(t[0] for t in df.itertuples() if t.n % 2 == 0)]

I get:

1000 loops, best of 3: 1.09 ms per loop
1000 loops, best of 3: 619 µs per loop # <-- iterrows generator
1000 loops, best of 3: 1.1 ms per loop
10000 loops, best of 3: 25 µs per loop # <--- your solution

However when you size it up:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 1000000)})

The difference disappear:

10 loops, best of 3: 40.5 ms per loop 
10 loops, best of 3: 40.7 ms per loop # <--- iterrows
10 loops, best of 3: 56.9 ms per loop

Your solution is quickest, so why not use it?

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
  • I agree. I would expect that breaking out of a loop upon hitting the target row, and thus skipping the rows below, would cut of more time than finding the fastest way to iterate _all_ rows. (Especially on large dataframes) – Thomas Fauskanger Dec 01 '17 at 22:59
  • Thank you Anton, I think I will eventually accept to write a loop in my code, which you show to be the quickest option – peter Dec 02 '17 at 10:21
  • I think your comparison is not fair, because with your oneliners, you are accessing the row of the dataframe for which `n % 2 == 0`, whereas for the for loop, you are not doing that. For a fair comparison, you could either add `df.iloc[j]` to the three lines of code or remove the `df.iloc` around the `next` statements. – EdG Aug 07 '19 at 05:08
1

An option to let you iterate rows and stop when you're satisfied, is to use the DataFrame.iterrows, which is pandas' row iterator.

In this case you could implement it something like this:

def get_first_row_with(condition, df):
    for index, row in df.iterrows():
        if condition(row):
            return index, row
    return None # Condition not met on any row in entire DataFrame

Then, given a DataFrame, e.g.:

df = pd.DataFrame({
                    'cats': [1,2,3,4], 
                    'dogs': [2,4,6,8]
                  }, 
                  index=['Alice', 'Bob', 'Charlie', 'Eve'])

That you can use as:

def some_condition(row):
    return row.cats + row.dogs >= 7

index, row = get_first_row_with(some_condition, df)

# Use results however you like, e.g.:
print('{} is the first person to have at least 7 pets.'.format(index))
print('They have {} cats and {} dogs!'.format(row.cats, row.dogs))

Which would output:

Charlie is the first person to have at least 7 pets.
They have 3 cats and 6 dogs!
Sunit Gautam
  • 5,495
  • 2
  • 18
  • 31
Thomas Fauskanger
  • 2,536
  • 1
  • 27
  • 42
  • Thank you Thomas, I like this solution from the style point of view. If I find no alternatives to a for loop I'll accept your answer soon – peter Dec 02 '17 at 10:27
  • I have tested this for loop against the original pandas version, it seems to have a similar performance if the condition is met at the beginning of the array, and then it becomes less efficient (graph in my answer) – peter Dec 04 '17 at 11:38
1

Zip both the index and column, then loop over that for faster loop speed. Zip provides the fastest looping performance, faster than iterrows() or itertuples().

for j in zip(df.index,df.n):
        if j[1] % 2 == 0:
                index_position = j[0]
                break
ajsp
  • 2,512
  • 22
  • 34
0

TLDR: You can use next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)


I think it is perfectly possible to do your code in a oneliner. Let's define a DataFrame to prove this:

df = pd.DataFrame({'n': np.random.randint(1, 10, size = 100000)})

First, you code gives:

for j in range(len(df)):
    if df.n.iloc[j] % 2 == 0:
        break
% 22.1 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Converting that to a oneliner gives:

next(j for j in range(len(df)) if df["n"].iloc[j] % 2 == 0)
% 20.6 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

To further speed up the calculation, we can make use of at instead of iloc, as this is faster when accessing single values:

next(j for j in range(len(df)) if df.at[j, "n"] % 2 == 0)
% 8.88 µs ± 617 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
EdG
  • 328
  • 1
  • 2
  • 8