13

So I would like make a slice of a dataframe and then set the value of the first item in that slice without copying the dataframe. For example:

df = pandas.DataFrame(numpy.random.rand(3,1))
df[df[0]>0][0] = 0

The slice here is irrelevant and just for the example and will return the whole data frame again. Point being, by doing it like it is in the example you get a setting with copy warning (understandably). I have also tried slicing first and then using ILOC/IX/LOC and using ILOC twice, i.e. something like:

df.iloc[df[0]>0,:][0] = 0
df[df[0]>0,:].iloc[0] = 0

And neither of these work. Again- I don't want to make a copy of the dataframe even if it id just the sliced version.

EDIT: It seems there are two ways, using a mask or IdxMax. The IdxMax method seems to work if your index is unique, and the mask method if not. In my case, the index is not unique which I forgot to mention in the initial post.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
RexFuzzle
  • 1,412
  • 2
  • 17
  • 30

4 Answers4

12

I think you can use idxmax for get index of first True value and then set by loc:

np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print (df)
   0
0  1
1  3
2  0
3  0
4  3

print ((df[0] == 0).idxmax())
2

df.loc[(df[0] == 0).idxmax(), 0] = 100
print (df)
     0
0    1
1    3
2  100
3    0
4    3

df.loc[(df[0] == 3).idxmax(), 0] = 200
print (df)
     0
0    1
1  200
2    0
3    0
4    3

EDIT:

Solution with not unique index:

np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
   0
1  1
2  3
2  0
3  0
4  3

df = df.reset_index()
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.set_index('index')
df.index.name = None
print (df)
     0
1    1
2  200
2    0
3    0
4    3

EDIT1:

Solution with MultiIndex:

np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
   0
1  1
2  3
2  0
3  0
4  3

df.index = [np.arange(len(df.index)), df.index]
print (df)
     0
0 1  1
1 2  3
2 2  0
3 3  0
4 4  3

df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.reset_index(level=0, drop=True)

print (df)
     0
1    1
2  200
2    0
3    0
4    3

EDIT2:

Solution with double cumsum:

np.random.seed(1)
df = pd.DataFrame([4,0,4,7,4], index=[1,2,2,3,4])
print (df)
   0
1  4
2  0
2  4
3  7
4  4

mask = (df[0] == 0).cumsum().cumsum()
print (mask)
1    0
2    1
2    2
3    3
4    4
Name: 0, dtype: int32

df.loc[mask == 1, 0] = 200
print (df)
     0
1    4
2  200
2    4
3    7
4    4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Consider the dataframe df

df = pd.DataFrame(dict(A=[1, 2, 3, 4, 5]))

print(df)

   A
0  1
1  2
2  3
3  4
4  5

Create some arbitrary slice slc

slc = df[df.A > 2]

print(slc)

   A
2  3
3  4
4  5

Access the first row of slc within df by using index[0] and loc

df.loc[slc.index[0]] = 0
print(df)

   A
0  1
1  2
2  0
3  4
4  5
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I was hoping to not duplicate any part of the df as it is large and even the slice could be quite big. – RexFuzzle Mar 06 '17 at 06:13
  • @RexFuzzle you said the slice was arbitrary and I'm assuming already exists. From that slice, I'm grabbing the first index value and using that to modify the original `df`. – piRSquared Mar 06 '17 at 06:18
  • I think something like `df.loc[slice, another_slice]` should be less memory intensive than `df.loc[slice].loc[:, another_slice]`. This is possible for row and column slicing at the same time but it appears it is not possible to do it row-wise with different conditions. I am not sure actually, maybe what I have in mind doesn't make sense. – ayhan Mar 09 '17 at 17:39
1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(6,1),index=[1,2,2,3,3,3])
df[1] = 0
df.columns=['a','b']
df['b'][df['a']>=0.5]=1
df=df.sort(['b','a'],ascending=[0,1])
df.loc[df[df['b']==0].index.tolist()[0],'a']=0

In this method extra copy of the dataframe is not created but an extra column is introduced which can be dropped after processing. To choose any index instead o the first one you can change the last line as follows

df.loc[df[df['b']==0].index.tolist()[n],'a']=0

to change any nth item in a slice

df

          a  
1  0.111089  
2  0.255633  
2  0.332682  
3  0.434527  
3  0.730548  
3  0.844724  

df after slicing and labelling them

          a  b
1  0.111089  0
2  0.255633  0
2  0.332682  0
3  0.434527  0
3  0.730548  1
3  0.844724  1

After changing value of first item in slice (labelled as 0) to 0

          a  b
3  0.730548  1
3  0.844724  1
1  0.000000  0
2  0.255633  0
2  0.332682  0
3  0.434527  0
0

So using some of the answers I managed to find a one liner way to do this:

np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print df
   0
0  1
1  3
2  0
3  0
4  3
df.loc[(df[0] == 0).cumsum()==1,0] = 1
   0
0  1
1  3
2  1
3  0
4  3

Essentially this is using the mask inline with a cumsum.

RexFuzzle
  • 1,412
  • 2
  • 17
  • 30