2

I want to drop the last row in a group based on a condition. I have done the following:

df=pd.read_csv('file')
grp = df.groupby('id')
for idx, i in grp:
   df= df[df['column2'].index[-1] == 'In']

     id     product   date
 0   220    in      2014-09-01 
 1   220    out     2014-09-03 
 2   220    in      2014-10-16
 3   826    in     2014-11-11
 4   826    out     2014-12-09
 5   826    out      2014-05-19
 6   901    in      2014-09-01
 7   901    out     2014-10-05
 8   901    out     2014-11-01

When i do this i simply get: KeyError: False

my desired output would be:

     id     product   date
 0   220    in      2014-09-01 
 1   220    out     2014-09-03
 3   826    in     2014-11-11
 4   826    out     2014-12-09 
 6   901    in      2014-09-01
 7   901    out     2014-10-05
Louise Ma
  • 41
  • 2

2 Answers2

2

If want remove last in only per groups chain inverted mask with Series.duplicated by ~ with not equal in with Series.ne:

df = df[~df['id'].duplicated() | df['product'].ne('in')]
print (df)
    id product        date
0  220      in  2014-09-01
1  220     out  2014-09-03
3  826      in  2014-11-11
4  826     out  2014-12-09
5  826     out  2014-05-19
6  901      in  2014-09-01
7  901     out  2014-10-05
8  901     out  2014-11-01

EDIT:

If want all possible pairs in-out per groups use this solution, only necessary map non numeric values in-out to numeric by dict, because rolling not working with strings:

#more general solution
print (df)
     id product        date
0   220     out  2014-09-03
1   220     out  2014-09-03
2   220      in  2014-09-01
3   220     out  2014-09-03
4   220      in  2014-10-16
5   826      in  2014-11-11
6   826      in  2014-11-11
7   826     out  2014-12-09
8   826     out  2014-05-19
9   901      in  2014-09-01
10  901     out  2014-10-05
11  901      in  2014-09-01
12  901     out  2014-11-01

pat = np.asarray(['in','out'])
N = len(pat)

d = {'in':0, 'out':1}
ma  = (df['product'].map(d)
                   .groupby(df['id'])
                   .rolling(window=N , min_periods=N)
                   .apply(lambda x: (x==list(d.values())).all(), raw=False)
                   .mask(lambda x: x == 0) 
                   .bfill(limit=N-1)
                   .fillna(0)
                   .astype(bool)
                   .reset_index(level=0, drop=True)
             )
df = df[ma]
print (df)
     id product        date
2   220      in  2014-09-01
3   220     out  2014-09-03
6   826      in  2014-11-11
7   826     out  2014-12-09
9   901      in  2014-09-01
10  901     out  2014-10-05
11  901      in  2014-09-01
12  901     out  2014-11-01
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The problem is that I need to iterate over each person first (therefore the groups), and then remove the last line, if it does not end on "out". Because now I have found out how to drop the row if they are not similar to the above one: df = df.loc [df['profuct'].shift() != df['product']] – Louise Ma Oct 27 '19 at 18:25
  • I think my problem maybe just changed a little bit... How can I access the last row within each column "id", and drop the row, if the column "product".eq('in') – Louise Ma Oct 27 '19 at 18:35
  • @LouiseMa - hmmm, not understand. Mainly write you need remove last id per groups, your expected output generate `in-out` values per groups and now in comment need something else. So now I am confused, what need exactly... – jezrael Oct 27 '19 at 18:37
  • hmm sorry... Okay, I tried to put in the second solution, and I get the error: "TypeError: cannot unpack non-iterable int object" – Louise Ma Oct 27 '19 at 18:41
  • @LouiseMa - OK, so can I ask something? Do you need one pair `in-out` per each group? – jezrael Oct 27 '19 at 18:42
  • @LouiseMa - hmm, can you create more general input data? it means is possible first row is `out`? Is possible only one value per groups? Is possible multiple `in-out` per groups? – jezrael Oct 27 '19 at 18:48
  • yes, possible first row is out.. and it is also possible that one group only have one value. And yes, there is also multiple in-out per group... – Louise Ma Oct 27 '19 at 18:57
1

An easy way is to add skipfooter=1 when opening the .csv file:

df = pd.read_csv(file, skipfooter=1, engine='python')
Max Voisard
  • 1,685
  • 1
  • 8
  • 18