3

An extension to my previous question. I have a source dataframe, which has three columns: Customer, Date and Item. I want to add a new column that contains Item History, being an array of all the Items for that Customer that are in earlier (defined by the Date) rows. Where a customer has made multiple purchases on the same date, neither row's item should be listed in the item history for the other.

So, given this sample data:

df = pd.DataFrame({'Customer':['Bert', 'Bert', 'Bert', 'Bert', 'Bert', 'Ernie', 'Ernie', 'Ernie', 'Ernie', 'Steven', 'Steven'], 'Date':['01/01/2019', '15/01/2019', '20/01/2019', '20/01/2019', '22/01/2019', '01/01/2019', '15/01/2019', '20/01/2019', '22/01/2019', '01/01/2019' ,'15/01/2019'], 'Item':['Bread', 'Cheese', 'Apples', 'Pears', 'Toothbrush', 'Toys', 'Shellfish', 'Dog', 'Yoghurt', 'Toilet', 'Dominos']})
Customer    Date    Item
Bert    01/01/2019  Bread
Bert    15/01/2019  Cheese
Bert    20/01/2019  Apples
Bert    20/01/2019  Pears
Bert    22/01/2019  Toothbrush
Ernie   01/01/2019  Toys
Ernie   15/01/2019  Shellfish
Ernie   20/01/2019  Dog
Ernie   22/01/2019  Yoghurt
Steven  01/01/2019  Toilet
Steven  15/01/2019  Dominos

The output I'd like to see would be:

Customer    Date    Item        Item History
Bert    01/01/2019  Bread       NaN
Bert    15/01/2019  Cheese      [Bread]
Bert    20/01/2019  Apples      [Bread, Cheese]
Bert    20/01/2019  Pears       [Bread, Cheese]
Bert    22/01/2019  Toothbrush  [Bread, Cheese, Apples, Pears]
Ernie   01/01/2019  Toys        NaN
Ernie   15/01/2019  Shellfish   [Toys]
Ernie   20/01/2019  Dog         [Toys, Shellfish]
Ernie   22/01/2019  Yoghurt     [Toys, Shellfish, Dog]
Steven  01/01/2019  Toilet      NaN
Steven  15/01/2019  Dominos     [Toilet]

Note that for Bert's purchases on 20/01/2019, neither's History column contains the other's item. For his 22/01/2019 purchase, both of the items from 20/01/2019 are included.

The answer to the previous question is a nifty bit of list comprehension, in the form:

df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer') 
                                          for i in range(len(x))]

df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan

But obviously "i" in the x.Item[:i] needs to work out the last row where the Date was not the same as the current row. Any advice on achieving that is much appreciated.

Dan Scally
  • 1,922
  • 1
  • 19
  • 31

3 Answers3

2

Another way with apply and np.cumsum():

#aggregates Item as list per 'Customer'& 'Date'
m=df.groupby(['Customer','Date'])['Item'].apply(lambda x: 
                               [*itertools.chain.from_iterable([x])])
#groups each Customer and cumsum the list with shift
n=m.groupby(level=0).apply(lambda x:np.cumsum(x).shift())
df.set_index(['Customer','Date']).assign(Item=n).reset_index() #assign back

   Customer        Date                            Item
0      Bert  01/01/2019                             NaN
1      Bert  15/01/2019                         [Bread]
2      Bert  20/01/2019                 [Bread, Cheese]
3      Bert  20/01/2019                 [Bread, Cheese]
4      Bert  22/01/2019  [Bread, Cheese, Apples, Pears]
5     Ernie  01/01/2019                             NaN
6     Ernie  15/01/2019                          [Toys]
7     Ernie  20/01/2019               [Toys, Shellfish]
8     Ernie  22/01/2019          [Toys, Shellfish, Dog]
9    Steven  01/01/2019                             NaN
10   Steven  15/01/2019                        [Toilet]
anky
  • 74,114
  • 11
  • 41
  • 70
2

Idea is distinguish duplicated values per groups by DataFrame.duplicated and then replace values to NaNs with forward filling missing values.

First values per groups are always empty strings, so not necessary replace per groups:

df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer') 
                                          for i in range(len(x))]

df['Item History'] = df['Item History'].mask(df.duplicated(['Customer','Date'])).ffill()

df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
print (df)
   Customer        Date        Item                    Item History
0      Bert  01/01/2019       Bread                             NaN
1      Bert  15/01/2019      Cheese                         [Bread]
2      Bert  20/01/2019      Apples                 [Bread, Cheese]
3      Bert  20/01/2019       Pears                 [Bread, Cheese]
4      Bert  22/01/2019  Toothbrush  [Bread, Cheese, Apples, Pears]
5     Ernie  01/01/2019        Toys                             NaN
6     Ernie  15/01/2019   Shellfish                          [Toys]
7     Ernie  20/01/2019         Dog               [Toys, Shellfish]
8     Ernie  22/01/2019     Yoghurt          [Toys, Shellfish, Dog]
9    Steven  01/01/2019      Toilet                             NaN
10   Steven  15/01/2019     Dominos                        [Toilet]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

A possibly simpler answer just using apply - this may be slower than other methods:

df['item history'] = df.apply(lambda x: 
            [i for i in list(df.loc[(df.Date<x.Date)&(df.Customer==x.Customer),'Item'])], axis=1)

result:

   Customer               ...                                  item history
0      Bert               ...                                            []
1      Bert               ...                                       [Bread]
2      Bert               ...                               [Bread, Cheese]
3      Bert               ...                               [Bread, Cheese]
4      Bert               ...                [Bread, Cheese, Apples, Pears]
5     Ernie               ...                                            []
6     Ernie               ...                                        [Toys]
7     Ernie               ...                             [Toys, Shellfish]
8     Ernie               ...                        [Toys, Shellfish, Dog]
9    Steven               ...                                            []
10   Steven               ...                                      [Toilet]

You may want to add list(set()) to the result if you want list of unique items.

Jim Eisenberg
  • 1,490
  • 1
  • 9
  • 17