2

I have a pandas dataframe like following

buyer_id  item_id  order_id    date        qty_purchased  qty_bought
     210      82      470  2016-01-02              5           1
     169      57      475  2016-01-02              5           1
     169      83      475  2016-01-02              5           1
     228      82      520  2016-01-03              4           1
     228      86      520  2016-01-03              4           1
     228      90      520  2016-01-03              4           1
     229      57      521  2016-01-03              4           1
     232      82      525  2016-01-04              4           3
     210      90      526  2016-01-04              4           1
     210      91      526  2016-01-04              5           1
     210      15      526  2016-01-05              5           1
     233      90      527  2016-01-05              4           1

I I want to find item_id introduced on each date and if item_id is introduced on more than 1 date then I want to find the ratio of `(qty_bought/qty_purchased) on each date.

My desired output is as follows

Item_id date 1st Introduced  Ratio Date 2nd Introduced Ratio Date 3rd Introduced Ratio Flag

 82        2016-01-02        1/5       2016-01-03       1/4      2016-01-04       3/4    1

Condition for a flag is when ratio is greater than previous dates then it should be set to 1 or else 0

If I have item introduced on 5 different dates then this should dynamically generate 5 dates and ratio columns. Ratio will be specific to that date. i want to list only those item_id which has introduced more than once.

This is my attempt in python

df.groupby('item_id')['date'].apply(lambda x: np.unique(x.tolist()))

this gives me list of item_id and the dates they were introduced on. Now I want to subset above if the item has introduced on more than 1 date.

df.groupby('item_id').apply(lambda r: r['date'].unique().shape[0] > 1)

This gives me all the item_id introduced on more than 1 date. But I didn't get how to make a dataframe with the desired output and how to add date & ratio column dynamically depending upon no of dates they were introduced on. Please help

ilyas patanam
  • 5,116
  • 2
  • 29
  • 33
Neil
  • 7,937
  • 22
  • 87
  • 145

1 Answers1

1

The first part of this question is to select those rows that have an item_id with more than one date and create a new dateframe with just these items.

#subset the items which have more than one date
items_1 = df.groupby('item_id').filter(lambda x: len(np.unique(x['date']))>1).item_id
#create a new dataframe with just those items that have more than one date
new_df = df[df['item_id'].isin(items_1)].copy()
#create the ratio columns
new_df['ratio'] = new_df['qty_bought']/new_df['qty_purchased']
#delete the columns that are not required
new_df.drop(['order_id', 'buyer_id','qty_purchased', 'qty_bought'], axis = 1, inplace= True)

 item_id    date    ratio
0   82  2016-01-02  0.20
1   57  2016-01-02  0.20
3   82  2016-01-03  0.25
5   90  2016-01-03  0.25
6   57  2016-01-03  0.25
7   82  2016-01-04  0.75
8   90  2016-01-04  0.25
11  90  2016-01-05  0.25

The second part of the question is to have only one row for each unique item_id and multiple columns for the corresponding dates and ratio. We use groupby to grab the entries for each item_id and then iterate through its date and ratio values while adding them to newly created columns in the dateframe.

#group by items and grab each date after the first and insert in a new column
for name, group in new_df.groupby('item_id'):
    for i in range(1, len(group)):
        new_df.loc[group.index[0], 'date'+str(i+1)] = group.date.iloc[i]
        new_df.loc[group.index[0], 'ratio'+str(i+1)] = group.ratio.iloc[i]
#delete the original date column since that information was replicated
new_df.drop(['date', 'ratio'], axis =1, inplace=True)
#keep only one row for each `item_id`
new_df.dropna(subset = ['date0'])


   item_id  date     ratio   date2     ratio2   date3     ratio3
0   82  2016-01-02  0.20    2016-01-03  0.25    2016-01-04  0.75
1   57  2016-01-02  0.20    2016-01-03  0.25    NaN         NaN
5   90  2016-01-03  0.25    2016-01-04  0.25    2016-01-05  0.25
ilyas patanam
  • 5,116
  • 2
  • 29
  • 33
  • I don't want repeatative dates can't we have unique date? so in your above output `2016-01-02` should be displayed only once. – Neil Feb 21 '16 at 06:24
  • I have edited the for loop such that we only add the dates that were not already in that row. If you'd like to rename the first date and ratio columns, take a look at [renaming columns](http://stackoverflow.com/a/20868446/1435522). Let me know if there is anything else. – ilyas patanam Feb 21 '16 at 07:23
  • Awesome.. thanks a ton.. I just want to add one more column which will calculate days difference between two dates `(date2-date1)` if there are 3 date columns then `(date2-date1) & (date3-date2)` – Neil Feb 21 '16 at 07:42
  • To add a column just do `newdf['new_col_name']`. Next use a for loop and take advantage of the fact that the when there are only 2 dates there is a NaN for date3. When there are three dates, there is no NaN for date 3. `pandas.notnull()` and `pandas.isnull()` will help with selecting those rows. – ilyas patanam Feb 23 '16 at 02:54