0

I have two columns a min_date and a max_date in dataframe where each row corresponds to a unique products and 3 levels of product hierarchy(product group), I have taken the difference between them to find the number of days in between(date_diff). Now , I want to see how many products are falling in different buckets. Say , how many products from group 04 have date_diff of more than 180 days,how many products from group 04 have date_diff of more than 150 days and less than 180 days ..and so on I will have 7 buckets of date_diff from 0-30 days difference to more than 180 days difference.

I am trying the following code :

    check_df=pd.DataFrame()
    for i in range(0,170331) :
         if (max_days_by_order.date_diff[i] > 160) :
                check_df[i] =  max_days_by_order.iloc[i]
        
    check_df  

I am getting this error :

'>' not supported between instances of 'Timedelta' and 'int'

my dataframe looks like this
enter image description here

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • use a timedelta object for comparison? e.g. `... > pd.Timedelta(days=160)` for 160 days timedelta, see [docs](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html) – FObersteiner Jul 22 '21 at 07:30
  • 1
    Does this answer your question? [Comparing a time delta in python](https://stackoverflow.com/questions/2591845/comparing-a-time-delta-in-python) – Björn Jul 22 '21 at 07:33
  • 1
    Is `date_diff` a `pd.Timedelta`? If so, use `...date_diff[i].days`. Check [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html?highlight=timedelta#pandas.Timedelta). – Timus Jul 22 '21 at 07:58
  • Thanks @MrFuppes , this worked. Can you also help me in creating a dataframe of the counts of prod_loc within each pg_external_code_1 for every bucket ? like the code I have written creates only one slice of the dataframe for the bucket for >160 days. – shilratna ganvir Jul 22 '21 at 09:35

1 Answers1

1

EDIT: Regarding your additional question:

Sample dataframe (simplified - please always add little samples to your question which can be copied, not screenshots):

df = pd.DataFrame({
    'prod_loc': range(10),
    'code_1': ['01'] * 5 + ['02'] * 5,
    'code_2': ['001'] * 3 + ['002'] * 3 + ['003'] * 4,
    'min_date': pd.to_datetime(['2021-07-22'] * 10),
    'max_date': pd.date_range('2021-07-22', periods=10, freq='25d')
})
df['date_diff'] = df.max_date - df.min_date
   prod_loc code_1 code_2   min_date   max_date date_diff
0         0     01    001 2021-07-22 2021-07-22    0 days
1         1     01    001 2021-07-22 2021-08-16   25 days
2         2     01    001 2021-07-22 2021-09-10   50 days
3         3     01    002 2021-07-22 2021-10-05   75 days
4         4     01    002 2021-07-22 2021-10-30  100 days
5         5     02    002 2021-07-22 2021-11-24  125 days
6         6     02    003 2021-07-22 2021-12-19  150 days
7         7     02    003 2021-07-22 2022-01-13  175 days
8         8     02    003 2021-07-22 2022-02-07  200 days
9         9     02    003 2021-07-22 2022-03-04  225 days

First step: Setting up buckets (you would choose others) and pd.cut-ing the diff_days-column with them:

buckets = list(range(0, 181, 50)) + [df.date_diff.max().days + 1]
cut = pd.cut(df.date_diff.dt.days, buckets, right=False)

And then, second step, do

result = df.groupby(['code_1', cut]).prod_loc.count().unstack(1)

which yields

date_diff  [0, 50)  [50, 100)  [100, 150)  [150, 226)
code_1                                               
01               2          2           1           0
02               0          0           1           4

or

result = df.groupby(['code_1', 'code_2', cut]).prod_loc.count().unstack(2)

which yields

date_diff      [0, 50)  [50, 100)  [100, 150)  [150, 226)
code_1 code_2                                            
01     001           2          1           0           0
       002           0          1           1           0
       003           0          0           0           0
02     001           0          0           0           0
       002           0          0           1           0
       003           0          0           0           4

You don't need to unstack if you prefer a longer view.

You can also try

df['buckets'] = cut
result = df.pivot_table(index=['code_1'], columns='buckets',
                        values='prod_loc', aggfunc='count')
result = df.pivot_table(index=['code_1', 'code_2'], columns='buckets',
                        values='prod_loc', aggfunc='count')

Is this what you are looking for?


Btw.: Don't iterate over dataframes, except you absolutely have to. Use the native Pandas methods. For example, for

max_days_by_order = pd.DataFrame({
    'min_date': pd.to_datetime(['2021-07-21', '2021-07-22']),
    'max_date': pd.to_datetime(['2021-10-21', '2022-07-22'])
})
max_days_by_order['date_diff'] = (max_days_by_order.max_date
                                  - max_days_by_order.min_date)
    min_date   max_date date_diff
0 2021-07-21 2021-10-21   92 days
1 2021-07-22 2022-07-22  365 days

this

check_df = max_days_by_order.date_diff.where(
                max_days_by_order.date_diff.dt.days > 180
           )

produces

0        NaT
1   365 days
Name: date_diff, dtype: timedelta64[ns]

Which seems to be what you are trying to achieve? (I don't have the full picture, so I might have missed something.)

Timus
  • 10,974
  • 5
  • 14
  • 28