Removing Rows in a Pandas Dataframe with Identical (and adjacent) entries in a specific column

Question

I have a dataframe where I have some dulicates in the "Item" column.

I want to remove the rows where there are dulicates (adjacent) but retain the last one i.e. Get rid of the red but keep the green

I then want to create a new column, where apples is assumed a start, and the next row is a time delta from this.i.e.

[Please don't post images of code/data (or links to them)](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question) instead post is as text or post dataframe reproducible code as mentioned [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — anky, Dec 18 '19 at 17:42

Scott Boston · Accepted Answer · 2019-12-18T18:21:24.060

3

IIUC, try:

df_out = df.assign(Item_cnt=(df['Item'] != df['Item'].shift()).cumsum())\
  .drop_duplicates(['Item','Item_cnt'], keep='last')

df_out['delta T'] = df_out['datetime'] - df_out.groupby((df_out['Item'] == 'apples').cumsum())['datetime'].transform('first')

Output:

      Item  datetime  Item_cnt  delta T
2   apples       1.2         1      0.0
3  oranges       2.3         2      1.1
4   apples       2.5         3      0.0
5  bananas       2.7         4      0.2

Details:

Create a grouping using cumsum and checking to see if the next line differs, then use drop_duplicates keeping the last record in that group.

edited Dec 18 '19 at 18:21

answered Dec 18 '19 at 17:50

Scott Boston

147,308
15
139
187

1

Shift shifts the values from one row to the next. In this case, it takes the value from the row above and "shifts" it down 1 place. So, I can check current row to the row above to see if the value has changed. – Scott Boston Dec 18 '19 at 18:09
@DataNovice thanks! for clearing up the detla T part for me. :) – Scott Boston Dec 18 '19 at 18:22
How would I ensure it was only "apples" that was being collapsed into the last entry – user11305439 Dec 18 '19 at 18:44

score 2 · Answer 2 · answered Dec 18 '19 at 18:13

2

IIUC,

df = pd.DataFrame({'Item' : ['apples', 'apples','apples','orange','apples','bananas'],
                  'dateTime' : [1,1.1,1.2,2.3,2.5,2.7]})


s = df.copy()

s['dateTime'] = s['dateTime'].round()

idx = s.drop_duplicates(subset=['Item','dateTime'],keep='last').index.tolist()

df = df.loc[idx]

df.loc[df['Item'].ne('apples'), 'delta'] = abs(df['dateTime'].shift() - df['dateTime'])

print(df.fillna(0))
      Item  dateTime  delta
2   apples       1.2    0.0
3   orange       2.3    1.1
4   apples       2.5    0.0
5  bananas       2.7    0.2

answered Dec 18 '19 at 18:13

Umar.H

22,559
7
39
74

1

Oh... ne('apples')? I missed that part of the problem. :) +1 – Scott Boston Dec 18 '19 at 18:15
1

thanks @ScottBoston his question wasn't very clear, but hey is any request ever clear ;) – Umar.H Dec 18 '19 at 18:16
what is ne in .ne('apples') ? – user11305439 Dec 18 '19 at 19:14
@user11305439 it means not equal, as you requested in your question. Does this answer your question? If so please green tick the solution – Umar.H Dec 18 '19 at 19:40
Hi, yes it does thankyou. I have accepted your answer, but did like Willie D's way also: df['Item_lag'] = df['Item'].shift(-1) df = df[df['Item'] != df['Item_lag']] # get rid of repeated Items but the deltaT didn't seem to work (1st one should be 0) – user11305439 Dec 18 '19 at 19:50

score 1 · Answer 3 · edited Dec 18 '19 at 18:30

1

Here is the df:

df = pd.DataFrame.from_dict({'Item':
['apples', 'apples', 'apples', 'oranges', 'apples', 'bananas'],
 'dateTime':[1, 1.1, 1.2, 2.3, 2.5, 2.7]})

You can't use duplcated because you need to keep multiple copies of the same item, so try this:

df['Item_lag'] = df['Item'].shift(-1)
df = df[df['Item'] != df['Item_lag']] # get rid of repeated Items
df['deltaT'] = df['dateTime'] - df['dateTime'].shift(1).fillna(0) # calculate time diff
df.drop(['dateTime', 'Item_lag'], axis=1, inplace=True) # drop extra columns
df # display df

out:

Item    deltaT
apples  1.2
oranges 1.1
apples  0.2
bananas 0.2

edited Dec 18 '19 at 18:30

Umar.H

22,559
7
39
74

answered Dec 18 '19 at 18:06

Willie D

145
1
10

How would I ensure it was only "apples" that was being collapsed into the last entry – user11305439 Dec 18 '19 at 18:44
1

How would I ensure it was only "apples" that was being collapsed into the last entry – user11305439 Dec 18 '19 at 18:44
If you only want one entry per class use `df[~df['Item'].duplicated(keep='last')]`. – Willie D Dec 18 '19 at 19:22
Sorry, I mean't if there are dulicates in oranges then it doesn't matter, its only duplicates in apples I need to collapse to a single row (last row). Also, I think your deltaT for the first apples should be 0 but its 1.2? – user11305439 Dec 18 '19 at 19:30
Replace `df = df[df['Item'] != df['Item_lag']]` with `df = df[(df['Item'] != df['Item_lag']) & df['Item'] == 'apples')]`. I am still not sure about the deltaT. – Willie D Dec 18 '19 at 20:17

Removing Rows in a Pandas Dataframe with Identical (and adjacent) entries in a specific column

3 Answers3