With your dataframe and function:
In [70]: df
Out[70]:
col_name
1 [16, 4, 30]
2 [5, 1, 2]
3 [4, 5, 52, 888]
4 [1, 2, 4]
5 [5, 99, 4, 75, 1, 2]
In [71]: df.values # values is an object array
Out[71]:
array([[list([16, 4, 30])],
[list([5, 1, 2])],
[list([4, 5, 52, 888])],
[list([1, 2, 4])],
[list([5, 99, 4, 75, 1, 2])]], dtype=object)
Using apply
, but returning a series, rather than modifying df
:
In [73]: df.col_name.apply(foo)
Out[73]:
1 [4]
2 [5, 1, 2]
3 [4, 5]
4 [1, 2, 4]
5 [5, 4, 1, 2]
Name: col_name, dtype: object
In [74]: timeit df.col_name.apply(foo)
214 µs ± 912 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For comparison apply foo
to the original dictionary, d
:
In [76]: {i:foo(d['col_name'][i]) for i in range(1,6)}
Out[76]: {1: [4], 2: [5, 1, 2], 3: [4, 5], 4: [1, 2, 4], 5: [5, 4, 1, 2]}
In [77]: timeit {i:foo(d['col_name'][i]) for i in range(1,6)}
18.3 µs ± 39.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note that this is faster than just extracting the list from the dataframe.
In [84]: timeit df.col_name.tolist()
25.3 µs ± 92 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
foo
applied to the list as opposed to the dictionary is about the same:
In [85]: dlist=df.col_name.tolist()
In [86]: timeit [foo(x) for x in dlist]
16.6 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
defining an object
vectorize function:
In [87]: f = np.vectorize(foo, otypes=[object])
In [88]: f(dlist)
Out[88]:
array([list([4]), list([5, 1, 2]), list([4, 5]), list([1, 2, 4]),
list([5, 4, 1, 2])], dtype=object)
In [89]: timeit f(dlist)
36.7 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is slower than the direct iteration. Preconverting the list to an object array (darr=np.array(dlist)
) just saves a µs or two.
Since we are returning an object array, we might as well use frompyfunc
(which vectorize
uses):
In [94]: ff = np.frompyfunc(foo, 1,1)
In [95]: ff(darr)
Out[95]:
array([list([4]), list([5, 1, 2]), list([4, 5]), list([1, 2, 4]),
list([5, 4, 1, 2])], dtype=object)
In [96]: timeit ff(darr)
18 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I've tested cases where frompyfunc
is up to 2x faster than the direct iteration. That might be the case here with a much bigger test array.
Among numpy
users np.vectorize
has a reputation for being slow, and often tricky to use (esp. if otypes
is omitted). Its apparent speed here is relative to pandas apply
, which appears to have a lot of overhead compared to array applications.
Given pandas
propensity to work with object dtype arrays, frompyfunc
might be a better tool than np.vectorize
.
As to why the plain vectorize
raise the error, I suspect it has to do with how it chooses the implied otypes
.
In [106]: f1 = np.vectorize(foo)
In [107]: f(darr[[0,0,0]])
Out[107]: array([list([4]), list([4]), list([4])], dtype=object)
In [108]: f1(darr[[0,0,0]])
...
ValueError: setting an array element with a sequence.
We'd have to dig into the vectorize
code, but I suspect it deduces from the first [4]
result that the return type should be an integer. But the actual calls return a list. Even a 1 element list won't fit in an integer slot.
Testing the vectorize
method that it uses to determine otypes
:
In [126]: f1._get_ufunc_and_otypes(foo,[darr])
Out[126]: (<ufunc '? (vectorized)'>, 'l')
_get_ufunc_and_otypes
calculates outputs
from the first element of the input array(s), and then does
if isinstance(outputs, tuple):
nout = len(outputs)
else:
nout = 1
outputs = (outputs,)
otypes = ''.join([asarray(outputs[_k]).dtype.char
for _k in range(nout)])
In your case outputs
is [4]
, a list, so it sets nout
to 1 and deduced otypes
from the first result. The same thing happens if the [5,1,2]
was first.
This automatic otypes
most often bites users when they want a float result, but the first value returns an integer such as 0. Then they get unexpected truncation.
That method has a test for type outputs
. Let's test that:
First version of foo
that returns a tuple instead of list:
In [162]: foot = lambda x: tuple(foo(x))
In [163]: [foot(x) for x in darr]
Out[163]: [(4,), (5, 1, 2), (4, 5), (1, 2, 4), (5, 4, 1, 2)]
In [164]: ft = np.vectorize(foot)
Same error when applied to the whole darr
:
In [165]: ft(darr)
...
ValueError: setting an array element with a sequence.
but when applied a subset of darr
that all return 3 elements, I get a tuple of arrays:
In [167]: ft(darr[[1,3,1,3]])
Out[167]: (array([5, 1, 5, 1]), array([1, 2, 1, 2]), array([2, 4, 2, 4]))
This doesn't help with the original problem, but does illustrate the power, or complications, of using np.vectorize
.