I've been trying to filter my Python data with Pandas & Numpy. There seems to be a difference between the 'where' clause in Pandas (pd) vs. Numpy (np). Both np and pd dataframes have the 'where' clause. The np 'where' makes sense, but the pd 'where' doesn't (to me).
#[In]#
np.random.seed(1000) ; rv = DataFrame(np.random.randn(1000,2))
rv[:10]
#[Out]#
# 0 1
# 0 -0.804458 0.320932
# 1 -0.025483 0.644324
# 2 -0.300797 0.389475
# 3 -0.107437 -0.479983
# 4 0.595036 -0.464668
But, when try to assign rv2 based off the value of a pd 'where' clause I get:
rv2 = rv.where(rv>=0,1,-1)
type(rv2)
# NoneType
rv[:10]
# 0 1
# 0 1.000000 0.320932
# 1 1.000000 0.644324
# 2 1.000000 0.389475
# 3 1.000000 1.000000
# 4 0.595036 1.000000
So rv2 is NoneType and rv has actually changed values. It's even unclear to me how rv ends up with its new values as they don't conform to where clause, as far as I can see.
However, if I use the np where clause instead of the dataframe 'where' clause, things work as expected (except I get a np array instead of a dataframe):
#[In]#
np.random.seed(1000) ; rv = DataFrame(np.random.randn(1000,2))
xy = np.where(rv>=0,1,-1)
xy[:5]
#[Out]#
# array([[-1, 1],
# [-1, 1],
# [-1, 1],
# [-1, -1],
# [ 1, -1],
The documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html#pandas.DataFrame.where states that the 'where' should return an object, not do an in-place change. However, the rv variable was changed in-place.
Can anyone tell me what is the difference between the two and how I am supposed to use the pd dataframe 'where'?