Efficiently compare pandas series of floats to pandas series of numpy arrays

Question

The following is reproducible and returns the desired results.

import pandas as pd, numpy as np
np.random.seed(3124)

x = 10 + np.random.rand(10)
y = np.split(10 + np.random.rand(100), 10)

x >= y
# array([[False,  True,  True, False, False, False, False,  True, False, True],
#        ...
#        [False,  True,  True,  True, False,  True, False,  True, False, False]])

np.apply_along_axis(np.greater_equal, 0, x , y)
# same results as x >= y.

However, if x and y from above were from above were pulled out of a pandas data frame, I have to convert the pandas series of arrays to a list of arrays. This is very computationally expensive for a large series.

How would I complete this in a more efficient way?

df = pd.DataFrame({'x':x,'y':y})

df['x'].values >= df['y'].tolist()
# same results as above.

df['x'] >= df['y']
# ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

df['x'].values >= df['y'].values
# ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Edit

@Divakar gave the correct answer to the question above. However, in my actual use case the arrays in y will all be different lengths.

Using y from above to create y2 which is closer to my data. The following is reproducible.

y2 = [np.resize(a, r) for a,r in zip(y,np.random.randint(2, 10, 10))]
# yields something like:
# [array([10.1269906 , 10.34269353, 10.39461373, 10.022271  , 10.69316165, 10.83981557, 10.03328485, 10.56850597]), 
# array([10.99159117, 10.21215159, 10.65208435, 10.22483111, 10.13748229, 10.72621328]), 
# ...
# array([10.61071355, 10.62141997]), 
# array([10.3899659 , 10.66207985, 10.85937807]), 
# array([10.38374303, 10.93140162, 10.88535643, 10.51529231, 10.60723795, 10.60504599, 10.6773523 ]), 
# array([10.02775067, 10.91382588, 10.31222259, 10.44732757, 10.16980452, 10.88914854, 10.22677905])]

The following returns the results I want, but is not feasible for the size of my actual data frame. I would rather do it in a vectorized form with numpy.

[x[i] >= y2[i] for i in range(len(y2))]
# returns 
# [array([False, False, False, False, False, False, False, False]),
#  array([False,  True, False,  True,  True, False]),
#  ...
#  array([ True,  True]), 
#  array([ True, False, False]),
#  array([False, False, False, False, False, False, False]),
#  array([ True,  True,  True,  True,  True,  True,  True])]

Yeah, I think with the ragged sizes, you have to resort to some loopy solution. Can't see vectorization helping here. — Divakar, Aug 01 '18 at 18:58
Is there a vectorized function that could pad all of the arrays with `np.nan` so they were all the same length, then use your solution? — Clay, Aug 01 '18 at 19:38
Think you are after this - https://stackoverflow.com/questions/40569220/efficiently-convert-uneven-list-of-lists-to-minimal-containing-array-padded-with — Divakar, Aug 01 '18 at 19:39
@Divakar the answer to the question you linked to in the comment above worked perfectly. Thank you. — Clay, Aug 04 '18 at 21:00

Divakar · Accepted Answer · 2018-08-01T10:20:10.953

1

Get the underlying array data, so that we would have y as 2D array, let's call it Y and x as 1D, calling it as X. Then perform the comparison to leverage broadcasting, like so -

Y = np.concatenate(df.y.values).reshape(-1,len(df.y[0]))
X = df.x.values
out = X >= Y

Note that this would compare each entry in df.y against x.

If you meant to compare each entry in x against each entry in df.y, extend X to 2D and then compare : out = X[:,None] >= Y.

edited Aug 01 '18 at 10:20

answered Aug 01 '18 at 10:06

Divakar

218,885
19
262
358

Thanks @Divakar. You answered my question as I phrased it, so I accepted it as the answer. Though for my actual use case, each array in `y` is a different size. – Clay Aug 01 '18 at 14:32
something like `y2 = [np.resize(a, r) for a,r in zip(y,np.random.randint(2, 10, 10))]` – Clay Aug 01 '18 at 14:51
@Clay Can you add sample for such a case and the expected output in the question? – Divakar Aug 01 '18 at 16:18
@ Divakar - more detail added. Thanks. – Clay Aug 01 '18 at 18:17

Efficiently compare pandas series of floats to pandas series of numpy arrays

Edit

1 Answers1