0

I have a problem. I want to calculate some date questions. But unfortunately I got an error ValueError: cannot reindex from a duplicate axis. I looked at What does `ValueError: cannot reindex from a duplicate axis` mean?. But nothing worked for me. How could I solve the problem?

I tried print(True in df.index.duplicated()) [OUT] False

# Did not work for me
#df[df.index.duplicated()]
#df = df.loc[:,~df.columns.duplicated()]
#df = df.reset_index()

Dataframe

    customerId    fromDate
0            1  2021-02-22
1            1  2021-03-18
2            1  2021-03-22
3            1  2021-02-10
4            1  2021-09-07
5            1        None
6            1  2022-01-18
7            2  2021-05-17
8            3  2021-05-17
9            3  2021-07-17
10           3  2021-02-22
11           3  2021-02-22
import pandas as pd

d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
     'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22', 
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
    }
df = pd.DataFrame(data=d)
#display(df)

#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])

#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']

#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

[OUT] 
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-3f715dc564ee> in <module>()
      3         .set_index('fromDate')
      4         .groupby('customerId')['lastInteractivity']
----> 5         .apply(lambda x: x.asfreq('d'))
      6         .reset_index())
Test
  • 571
  • 13
  • 32
  • 2
    Customer 3 has two rows where the `'fromDate'` is `2021-02-22`. So right before your `groupby` when you set that to the index you now create an Index with duplication. How do you want to handle this case? For instance, since the data in this instance just seems to be duplicated you could just add a `.drop_duplicates(['customerId', 'fromDate')` before the `.set_index`, though you'll need to decide if that logic makes sense – ALollz May 24 '22 at 14:40
  • 1
    you could `groupby` to get the `min`/`max` fromDate – mozway May 24 '22 at 14:41

1 Answers1

1

Indeed I arrived at the same conclusion than @ALollz said in his comment, by using the drop_duplicates, you have the expected result :

#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
        .drop_duplicates(['fromDate', 'customerId'])
        .set_index('fromDate')
        .groupby('customerId')['lastInteractivity']
        .apply(lambda x: x.asfreq('d'))
        .reset_index())

Output :

    customerId  fromDate    lastInteractivity
0   1           2021-02-10  468 days
1   1           2021-02-11  NaT
2   1           2021-02-12  NaT
3   1           2021-02-13  NaT
4   1           2021-02-14  NaT
...
485 3           2021-07-13  NaT
486 3           2021-07-14  NaT
487 3           2021-07-15  NaT
488 3           2021-07-16  NaT
489 3           2021-07-17  311 days
tlentali
  • 3,407
  • 2
  • 14
  • 21