3

Here's the question I'm trying to answer.

Suppose I have a Pandas df

      col_name
1    [16, 4, 30]   
2    [5, 1, 2]   
3    [4, 5, 52, 888]
4    [1, 2, 4]
5    [5, 99, 4, 75, 1, 2]

I would like to remove all the elements in the whole column that appears less than x times, for example let's take x = 3

It means that I would like to have the result looks like:

      col_name
1    [4]   
2    [5, 1, 2]   
3    [4, 5]
4    [1, 2, 4]
5    [5, 4, 1, 2]

For the sake of convenience, here's the data.

d = {'col_name': {1: [16, 4, 30],
      2: [5, 1, 2],
      3: [4, 5, 52, 888],
      4: [1, 2, 4],
      5: [5, 99, 4, 75, 1, 2]}}

df = pd.DataFrame(d)

Current approach:

from collections import Counter
c = Counter(pd.Series(np.concatenate(df.col_name.tolist())))

def foo(array):
    return [x  for x in array if c[x] >= 3]

df.col_name = df.col_name.apply(foo)
df

       col_name
1           [4]
2     [5, 1, 2]
3        [4, 5]
4     [1, 2, 4]
5  [5, 4, 1, 2]

Which works, but is slow. So, I thought to use np.vectorize and speed this up:

v  = np.vectorize(foo)
df.col_name = v(df.col_name)   # <---- error thrown here

And get this error:

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy/lib/function_base.py in _vectorize_call(self, func, args)
   2811 
   2812             if ufunc.nout == 1:
-> 2813                 res = array(outputs, copy=False, subok=True, dtype=otypes[0])
   2814             else:
   2815                 res = tuple([array(x, copy=False, subok=True, dtype=t)

ValueError: setting an array element with a sequence.

It seems I have a misunderstanding of how np.vectorize works. What am I doing wrong, and how can I get this solution to work with np.vectorize, if at all?

To clarify, I'm not looking for a workaround, just a little help understanding why I'm getting this error.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • `np.vectorize` never speeds things up. Internally it is just a `for` loop. – Nils Werner Sep 13 '17 at 09:13
  • @NilsWerner https://stackoverflow.com/a/46163829/4909087 – cs95 Sep 13 '17 at 09:14
  • In my opinion there is problem different legths of arrays... and numpy like same length, same types - then vectorized function working perfectly. Maybe some numpy guy correct me... – jezrael Sep 13 '17 at 09:19
  • Downvoter, do you have a problem with this question? – cs95 Sep 13 '17 at 09:19
  • @jezrael I tried this on lists of 1 element and it gives the same error. – cs95 Sep 13 '17 at 09:20
  • then not sure, but I think if working with not same lengths of array then numpy slow, problematic with vectorizing. Maybe problem is output has different lengths... If same lengths, it should works. – jezrael Sep 13 '17 at 09:25
  • 1
    Usually `vectorize` is compared with a numpy iteration, not a pandas `apply`. – hpaulj Sep 13 '17 at 14:49

2 Answers2

4

You need to specify the output data type, otypes=[list/object/np.ndarray/etc] in np.vectorize

In [2767]: def foo(array):
      ...:     return [x  for x in array if c[x] >= 3]

In [2768]: v = np.vectorize(foo, otypes=[list])

In [2769]: v(df.col_name)
Out[2769]: array([[4], [5, 1, 2], [4, 5], [1, 2, 4], [5, 4, 1, 2]], dtype=object)

In [2770]: df.assign(new_wack=v(df.col_name))
Out[2770]:
               col_name      new_wack
1           [16, 4, 30]           [4]
2             [5, 1, 2]     [5, 1, 2]
3       [4, 5, 52, 888]        [4, 5]
4             [1, 2, 4]     [1, 2, 4]
5  [5, 99, 4, 75, 1, 2]  [5, 4, 1, 2]

From the docs,

If otypes is not specified, then a call to the function with the first argument will be used to determine the number of outputs. The results of this call will be cached if cache is True to prevent calling the function twice.

Zero
  • 74,117
  • 18
  • 147
  • 154
  • Brilliant! The fix was so simple. – cs95 Sep 13 '17 at 09:50
  • According to the documentation otypes is optional and is determined in the first function call if not specified. Can you tell us why np.vectorize through the error even when @coldspeed tried it with lists of one element? – P.Tillmann Sep 13 '17 at 09:53
  • @P.Tillmann, the first call returns a single element list which is probably seen as integer dtype. Note that returned dtype is object, not list. – hpaulj Sep 13 '17 at 14:44
  • John, I kinda feel like I need to accept hpaulj's answer instead. I generally dislike changing answers but this time... I really need to... sorry! – cs95 Sep 13 '17 at 18:38
4

With your dataframe and function:

In [70]: df
Out[70]: 
               col_name
1           [16, 4, 30]
2             [5, 1, 2]
3       [4, 5, 52, 888]
4             [1, 2, 4]
5  [5, 99, 4, 75, 1, 2]

In [71]: df.values     # values is an object array
Out[71]: 
array([[list([16, 4, 30])],
       [list([5, 1, 2])],
       [list([4, 5, 52, 888])],
       [list([1, 2, 4])],
       [list([5, 99, 4, 75, 1, 2])]], dtype=object)

Using apply, but returning a series, rather than modifying df:

In [73]: df.col_name.apply(foo)
Out[73]: 
1             [4]
2       [5, 1, 2]
3          [4, 5]
4       [1, 2, 4]
5    [5, 4, 1, 2]
Name: col_name, dtype: object
In [74]: timeit df.col_name.apply(foo)
214 µs ± 912 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For comparison apply foo to the original dictionary, d:

In [76]: {i:foo(d['col_name'][i]) for i in range(1,6)}
Out[76]: {1: [4], 2: [5, 1, 2], 3: [4, 5], 4: [1, 2, 4], 5: [5, 4, 1, 2]}
In [77]: timeit {i:foo(d['col_name'][i]) for i in range(1,6)}
18.3 µs ± 39.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note that this is faster than just extracting the list from the dataframe.

In [84]: timeit df.col_name.tolist()
25.3 µs ± 92 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

foo applied to the list as opposed to the dictionary is about the same:

In [85]: dlist=df.col_name.tolist()
In [86]: timeit [foo(x) for x in dlist]
16.6 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

defining an object vectorize function:

In [87]: f = np.vectorize(foo, otypes=[object])
In [88]: f(dlist)
Out[88]: 
array([list([4]), list([5, 1, 2]), list([4, 5]), list([1, 2, 4]),
       list([5, 4, 1, 2])], dtype=object)
In [89]: timeit f(dlist)
36.7 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

This is slower than the direct iteration. Preconverting the list to an object array (darr=np.array(dlist)) just saves a µs or two.

Since we are returning an object array, we might as well use frompyfunc (which vectorize uses):

In [94]: ff = np.frompyfunc(foo, 1,1)
In [95]: ff(darr)
Out[95]: 
array([list([4]), list([5, 1, 2]), list([4, 5]), list([1, 2, 4]),
       list([5, 4, 1, 2])], dtype=object)
In [96]: timeit ff(darr)
18 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I've tested cases where frompyfunc is up to 2x faster than the direct iteration. That might be the case here with a much bigger test array.

Among numpy users np.vectorize has a reputation for being slow, and often tricky to use (esp. if otypes is omitted). Its apparent speed here is relative to pandas apply, which appears to have a lot of overhead compared to array applications.

Given pandas propensity to work with object dtype arrays, frompyfunc might be a better tool than np.vectorize.


As to why the plain vectorize raise the error, I suspect it has to do with how it chooses the implied otypes.

In [106]: f1 = np.vectorize(foo)
In [107]: f(darr[[0,0,0]])
Out[107]: array([list([4]), list([4]), list([4])], dtype=object)
In [108]: f1(darr[[0,0,0]])
...
ValueError: setting an array element with a sequence.

We'd have to dig into the vectorize code, but I suspect it deduces from the first [4] result that the return type should be an integer. But the actual calls return a list. Even a 1 element list won't fit in an integer slot.

Testing the vectorize method that it uses to determine otypes:

In [126]: f1._get_ufunc_and_otypes(foo,[darr])
Out[126]: (<ufunc '? (vectorized)'>, 'l')

_get_ufunc_and_otypes calculates outputs from the first element of the input array(s), and then does

        if isinstance(outputs, tuple):
            nout = len(outputs)
        else:
            nout = 1
            outputs = (outputs,)

        otypes = ''.join([asarray(outputs[_k]).dtype.char
                          for _k in range(nout)])

In your case outputs is [4], a list, so it sets nout to 1 and deduced otypes from the first result. The same thing happens if the [5,1,2] was first.

This automatic otypes most often bites users when they want a float result, but the first value returns an integer such as 0. Then they get unexpected truncation.

That method has a test for type outputs. Let's test that:

First version of foo that returns a tuple instead of list:

In [162]: foot = lambda x: tuple(foo(x))
In [163]: [foot(x) for x in darr]
Out[163]: [(4,), (5, 1, 2), (4, 5), (1, 2, 4), (5, 4, 1, 2)]
In [164]: ft = np.vectorize(foot)

Same error when applied to the whole darr:

In [165]: ft(darr)
...
ValueError: setting an array element with a sequence.

but when applied a subset of darr that all return 3 elements, I get a tuple of arrays:

In [167]: ft(darr[[1,3,1,3]])
Out[167]: (array([5, 1, 5, 1]), array([1, 2, 1, 2]), array([2, 4, 2, 4]))

This doesn't help with the original problem, but does illustrate the power, or complications, of using np.vectorize.

hpaulj
  • 221,503
  • 14
  • 230
  • 353