3

I've been trying to filter my Python data with Pandas & Numpy. There seems to be a difference between the 'where' clause in Pandas (pd) vs. Numpy (np). Both np and pd dataframes have the 'where' clause. The np 'where' makes sense, but the pd 'where' doesn't (to me).

#[In]#
np.random.seed(1000) ; rv = DataFrame(np.random.randn(1000,2))
rv[:10]
#[Out]#
#           0         1
# 0 -0.804458  0.320932
# 1 -0.025483  0.644324
# 2 -0.300797  0.389475
# 3 -0.107437 -0.479983
# 4  0.595036 -0.464668

But, when try to assign rv2 based off the value of a pd 'where' clause I get:

rv2 = rv.where(rv>=0,1,-1)
type(rv2)
# NoneType
rv[:10]
#           0         1
# 0  1.000000  0.320932
# 1  1.000000  0.644324
# 2  1.000000  0.389475
# 3  1.000000  1.000000
# 4  0.595036  1.000000

So rv2 is NoneType and rv has actually changed values. It's even unclear to me how rv ends up with its new values as they don't conform to where clause, as far as I can see.

However, if I use the np where clause instead of the dataframe 'where' clause, things work as expected (except I get a np array instead of a dataframe):

#[In]#
np.random.seed(1000) ; rv = DataFrame(np.random.randn(1000,2))
xy = np.where(rv>=0,1,-1)
xy[:5]
#[Out]#
# array([[-1,  1],
#        [-1,  1],
#        [-1,  1],
#        [-1, -1],
#        [ 1, -1],

The documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html#pandas.DataFrame.where states that the 'where' should return an object, not do an in-place change. However, the rv variable was changed in-place.

Can anyone tell me what is the difference between the two and how I am supposed to use the pd dataframe 'where'?

Gene
  • 97
  • 2
  • 7
  • I searched and found no duplicate. I did a thorough search, but the topic "comparing two DataFrames, specific questions" is very common and broad. However, I do see my question buried in there so I think the answer there is good. However, it does not answer why my original variable, rv, was modified. – Gene Jul 03 '17 at 20:41

0 Answers0