I have a very large Dataframe, where one column contains numbers and another contains text. I want to create a 3rd column, based on the number column and the text column and a complex custom function, in the most efficient way.
According to this source, the most efficient way is using NumPy vectorization.
(Below is simplified example code to clarify what I tried and where I am stuck. The actual custom function is quite complex, but does indeed take as input numerical columns and text columns. With this simplified code below I want to understand how to apply functions that take strings as input on entire columns)
This works flawlessly, so far so good:
def fun_test1(no1, no2):
res = no1 + no2
return res
Test1 = pd.DataFrame({'no1':[1, 2, 3],
'no2':[1, 2, 3]})
Test1['result'] = fun_test1(Test1['no1'].values, Test1['no2'].values)
no1 no2 result
0 1 1 2
1 2 2 4
2 3 3 6
This however does not work and this is where I am stuck:
def fun_test2(no1, text):
if text == 'one':
no2 = 1
elif text == 'two':
no2 = 2
elif text == 'three':
no2 = 3
res = no1 + no2
return res
Test2 = pd.DataFrame({'no1':[1, 2, 3],
'text':['one', 'two', 'three']})
Test2['result'] = fun_test2(Test2['no1'].values, Test2['text'].values)
ValueError Traceback (most recent call last)
<ipython-input-30-a8f100d7d4bd> in <module>()
----> 1 Test2['result'] = fun_test2(Test2['no1'].values, Test2['text'].values)
<ipython-input-27-8347aa91d765> in fun_test2(no1, text)
1 def fun_test2(no1, text):
----> 2 if text == 'one':
3 no2 = 1
4 elif text == 'two':
5 no2 = 2
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have tried more variations but ultimately I cannot get NumPy vectorization to work with string inputs.
What am I doing wrong?
If NumPy vectorization does not work with strings, what would be the next most efficient method?