0

I'm trying to get more comfortable with various ways of using Pandas, and I'm struggling to understand why Map, Apply, and Vectorization are relatively interchangeable with functions that return non-booleans, but Apply and Vectorization sometimes fail when the function being applied returns a boolean. This question will focus on Apply.

Specifically, I wrote the very simple little code to illustrate the challenge:

import numpy as np
import pandas as pd

# make dataframe
x = range(1000)
df = pd.DataFrame(data = x, columns = ['Number']) 

# simple function to test if a number is a prime number
def is_prime(num):
    if num < 2:
        return False
    elif num == 2: 
        return True
    else: 
        for i in range(2,num):
            if num % i == 0:
                return False
    return True

# test if every number in the dataframe is prime using Map
df['map prime'] = list(map(is_prime, df['Number']))
df.head()

The following gives the output I'd expect: enter image description here

So here's where I no longer understand what's going on: when I try to use apply, I get a ValueError.

in: df['apply prime'] = df.apply(func = is_prime, args = df['Number'], axis=1)
out: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What am I missing?

Thank you!

p.s. I know there are more efficient ways to test for primes. I purposefully wrote an inefficient function so I could test how much faster apply and vectorization really were than map, but then I ran into this challenge. Thank you.

BLimitless
  • 2,060
  • 5
  • 17
  • 32

1 Answers1

2

So here's where I no longer understand what's going on: when I try to use apply, I get a ValueError.

df.apply(..., axis=1), pass pd.Series(...).

i.e. df['apply prime'] = df['Number'].apply(func = is_prime) should work.

Given that apply is ostensibly faster than map, and vectorization faster still.

In addition pd.DataFrame.apply(...), doesn't use any type of vectorization, just a simple C for loop (ex. cython), so believe that map(...) should be asymptotically faster.


Update

You might need to figure that, .apply(...), method passes the values of a given axis=x to the function and returns Y, which could be any data type, In case of pd.DataFrame (multiple keys).

Suppose that df.shape = (1000, 4), if we are intend to move along axis=1, i.e. df.shape[1], it's means your apply function going to be called 1000 times, each run it's got (4, ) element of a type pd.Series, you could use there keys inside the function itself, or just pass the keys as an arguments, pd.DataFrame.apply(..., args=[...]).


import numpy as np
import pandas as pd

x = np.random.randn(1000, 4)
df = pd.DataFrame(data=x, columns=['a', 'b', 'c', 'd'])

print(df.shape)

df.head()

def func(x, key1, key2):

  # print(x.shape)

  if x[key1] > x[key2]:
    
    return True

  return False

df.apply(func, axis=1, args=['a', 'b'])
4.Pi.n
  • 1,151
  • 6
  • 15
  • That did work, thank you! But now I have a second question: what if I want to apply a function that takes two variables as inputs and returns a boolean (e.g. if x > y return true)? How do I pass both? I tried calling df['Num1', 'Num2'].apply(...), but that threw a key error even though the keys were correct. Map worked when I passed the keys in as a tuple. Thoughts on how to use apply with a function that takes multiple inputs? – BLimitless Feb 09 '21 at 05:34
  • @BLimitless, you can refer to [this answer](https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe/52854800#52854800) of the post [How to apply a function to two columns of Pandas dataframe](https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe) – SeaBean Feb 09 '21 at 06:22
  • @BLimitless, note also that your use of list(map(...)) is actually faster than apply(...axis=1). You can refer to the timing comparison in [this answer](https://stackoverflow.com/a/46923192/15070697) of the same post I suggested above. – SeaBean Feb 09 '21 at 06:25
  • @4.Pi.n, I think your sample code can be simplified as df.apply(lambda x: func(x['a'], x['b']), axis=1) and define `def func(key1, key2)` and `if key1 > key2` so that this function can be more generic and be used in scope other than pandas. – SeaBean Feb 09 '21 at 06:41
  • @BLimitless, see also [my answer](https://stackoverflow.com/a/66034661/15070697) in a previous post with comparison of list(map(..)) and apply(...axis=1). And [this one](https://stackoverflow.com/a/66062197/15070697) as well. Hence, suggest to stick on using list(map(..))) instead of apply(...axis=1). – SeaBean Feb 09 '21 at 07:10