In Python/Pandas, what is the most efficient way, to apply a custom function, to a column of a dataframe, where the input includes strings?

Question

I have a very large Dataframe, where one column contains numbers and another contains text. I want to create a 3rd column, based on the number column and the text column and a complex custom function, in the most efficient way.

According to this source, the most efficient way is using NumPy vectorization.

(Below is simplified example code to clarify what I tried and where I am stuck. The actual custom function is quite complex, but does indeed take as input numerical columns and text columns. With this simplified code below I want to understand how to apply functions that take strings as input on entire columns)

This works flawlessly, so far so good:

def fun_test1(no1, no2):
    res = no1 + no2
    return res

Test1 = pd.DataFrame({'no1':[1, 2, 3],
                     'no2':[1, 2, 3]})

Test1['result'] = fun_test1(Test1['no1'].values, Test1['no2'].values)

    no1 no2 result
0   1   1   2
1   2   2   4
2   3   3   6

This however does not work and this is where I am stuck:

def fun_test2(no1, text):
    if text == 'one':
        no2 = 1
    elif text == 'two':
        no2 = 2
    elif text == 'three':
        no2 = 3
    res = no1 + no2
    return res

Test2 = pd.DataFrame({'no1':[1, 2, 3],
                      'text':['one', 'two', 'three']})

Test2['result'] = fun_test2(Test2['no1'].values, Test2['text'].values)

ValueError                                Traceback (most recent call last)
<ipython-input-30-a8f100d7d4bd> in <module>()
----> 1 Test2['result'] = fun_test2(Test2['no1'].values, Test2['text'].values)

<ipython-input-27-8347aa91d765> in fun_test2(no1, text)
      1 def fun_test2(no1, text):
----> 2     if text == 'one':
      3         no2 = 1
      4     elif text == 'two':
      5         no2 = 2

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I have tried more variations but ultimately I cannot get NumPy vectorization to work with string inputs.

What am I doing wrong?

If NumPy vectorization does not work with strings, what would be the next most efficient method?

Does this answer your question? [Pandas conditional creation of a series/dataframe column](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column) — Henry Yik, Oct 26 '20 at 14:33
You'd "translate" text values(based on dictionary) to their numeral value, then apply summing columns. — 404pio, Oct 26 '20 at 14:37
as @HenryYik said, there are easier ways to do what you are trying to do. However, if you still want to use customer function you have to use `apply` & do it one row at a time. The reason for error is that the condition is check for the whole column & some rows may be`True` & some may be 'False` for the condition you have given, as the error says, you will have to say `any()` or 'all()` to clear this up. Just change the 2nd line of your function from `if text == 'one':` to `if (text == 'one').any():`. It will run fine but the result may not be what you want. — moys, Oct 26 '20 at 14:39
@HenryYik: it does not, because the answer is using standard functions. I need to use a rather complex custom function. The code I provided is just so simple for demonstration. — Usal, Oct 26 '20 at 14:47
The code is just simplified example code to clarify what I tried and where I am stuck. The actual custom function is quite complex, but does indeed take as input numerical columns and text columns. With this simplified code I want to understand how to apply functions that take strings as input on entire columns. It really is just about that: how to efficiently apply custom functions to entire columns of dataframes, if one or more inputs are strings. — Usal, Oct 26 '20 at 14:52

Aaj Kaal · Answer 1 · 2020-10-26T15:53:34.270

def fun_test2(no1, text, idx):
    if text[idx] == 'one':
        no2 = 1
    elif text[idx] == 'two':
        no2 = 2
    elif text[idx] == 'three':
        no2 = 3
    res = no1[idx] + no2
    return res

Test2 = pd.DataFrame({'no1':[1, 2, 3],
                      'text':['one', 'two', 'three']})

Test2['result'] = [fun_test2(Test2['no1'].values, Test2['text'].values, i) for i in range(Test2.shape[0])]

Ouput:

>>> Test2
   no1   text  result
0    1    one       2
1    2    two       4
2    3  three       6

OR back to traditional way with the same Output:

def fun_test2(no1, text):
    if text == 'one':
        no2 = 1
    elif text == 'two':
        no2 = 2
    elif text == 'three':
        no2 = 3
    res = no1 + no2
    return res

Test2 = pd.DataFrame({'no1':[1, 2, 3],
                      'text':['one', 'two', 'three']})

Test2['result'] = [fun_test2(Test2['no1'].values[i], Test2['text'].values[i]) for i in range(Test2.shape[0])]

Thx, but is this not very inefficient because it is a loop? – Usal Oct 26 '20 at 15:51 — Usal, Oct 26 '20 at 15:51

In Python/Pandas, what is the most efficient way, to apply a custom function, to a column of a dataframe, where the input includes strings?

1 Answers1