3

I have a dataframe where I have some dulicates in the "Item" column.

enter image description here

I want to remove the rows where there are dulicates (adjacent) but retain the last one i.e. Get rid of the red but keep the green

enter image description here

I then want to create a new column, where apples is assumed a start, and the next row is a time delta from this.i.e.

enter image description here

user11305439
  • 117
  • 9
  • [Please don't post images of code/data (or links to them)](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question) instead post is as text or post dataframe reproducible code as mentioned [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – anky Dec 18 '19 at 17:42

3 Answers3

3

IIUC, try:

df_out = df.assign(Item_cnt=(df['Item'] != df['Item'].shift()).cumsum())\
  .drop_duplicates(['Item','Item_cnt'], keep='last')

df_out['delta T'] = df_out['datetime'] - df_out.groupby((df_out['Item'] == 'apples').cumsum())['datetime'].transform('first')

Output:

      Item  datetime  Item_cnt  delta T
2   apples       1.2         1      0.0
3  oranges       2.3         2      1.1
4   apples       2.5         3      0.0
5  bananas       2.7         4      0.2

Details:

Create a grouping using cumsum and checking to see if the next line differs, then use drop_duplicates keeping the last record in that group.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    Shift shifts the values from one row to the next. In this case, it takes the value from the row above and "shifts" it down 1 place. So, I can check current row to the row above to see if the value has changed. – Scott Boston Dec 18 '19 at 18:09
  • @DataNovice thanks! for clearing up the detla T part for me. :) – Scott Boston Dec 18 '19 at 18:22
  • How would I ensure it was only "apples" that was being collapsed into the last entry – user11305439 Dec 18 '19 at 18:44
2

IIUC,

df = pd.DataFrame({'Item' : ['apples', 'apples','apples','orange','apples','bananas'],
                  'dateTime' : [1,1.1,1.2,2.3,2.5,2.7]})


s = df.copy()

s['dateTime'] = s['dateTime'].round()

idx = s.drop_duplicates(subset=['Item','dateTime'],keep='last').index.tolist()

df = df.loc[idx]

df.loc[df['Item'].ne('apples'), 'delta'] = abs(df['dateTime'].shift() - df['dateTime'])

print(df.fillna(0))
      Item  dateTime  delta
2   apples       1.2    0.0
3   orange       2.3    1.1
4   apples       2.5    0.0
5  bananas       2.7    0.2
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • 1
    Oh... ne('apples')? I missed that part of the problem. :) +1 – Scott Boston Dec 18 '19 at 18:15
  • 1
    thanks @ScottBoston his question wasn't very clear, but hey is any request ever clear ;) – Umar.H Dec 18 '19 at 18:16
  • what is ne in .ne('apples') ? – user11305439 Dec 18 '19 at 19:14
  • @user11305439 it means not equal, as you requested in your question. Does this answer your question? If so please green tick the solution – Umar.H Dec 18 '19 at 19:40
  • Hi, yes it does thankyou. I have accepted your answer, but did like Willie D's way also: df['Item_lag'] = df['Item'].shift(-1) df = df[df['Item'] != df['Item_lag']] # get rid of repeated Items but the deltaT didn't seem to work (1st one should be 0) – user11305439 Dec 18 '19 at 19:50
1

Here is the df:

df = pd.DataFrame.from_dict({'Item':
['apples', 'apples', 'apples', 'oranges', 'apples', 'bananas'],
 'dateTime':[1, 1.1, 1.2, 2.3, 2.5, 2.7]})

You can't use duplcated because you need to keep multiple copies of the same item, so try this:

df['Item_lag'] = df['Item'].shift(-1)
df = df[df['Item'] != df['Item_lag']] # get rid of repeated Items
df['deltaT'] = df['dateTime'] - df['dateTime'].shift(1).fillna(0) # calculate time diff
df.drop(['dateTime', 'Item_lag'], axis=1, inplace=True) # drop extra columns
df # display df

out:

Item    deltaT
apples  1.2
oranges 1.1
apples  0.2
bananas 0.2
Umar.H
  • 22,559
  • 7
  • 39
  • 74
Willie D
  • 145
  • 1
  • 10
  • How would I ensure it was only "apples" that was being collapsed into the last entry – user11305439 Dec 18 '19 at 18:44
  • 1
    How would I ensure it was only "apples" that was being collapsed into the last entry – user11305439 Dec 18 '19 at 18:44
  • If you only want one entry per class use `df[~df['Item'].duplicated(keep='last')]`. – Willie D Dec 18 '19 at 19:22
  • Sorry, I mean't if there are dulicates in oranges then it doesn't matter, its only duplicates in apples I need to collapse to a single row (last row). Also, I think your deltaT for the first apples should be 0 but its 1.2? – user11305439 Dec 18 '19 at 19:30
  • Replace `df = df[df['Item'] != df['Item_lag']]` with `df = df[(df['Item'] != df['Item_lag']) & df['Item'] == 'apples')]`. I am still not sure about the deltaT. – Willie D Dec 18 '19 at 20:17