How to resolve ValueError: cannot reindex from a duplicate axis

Question

Input

Client	First name	Last Name	Start Date	End Date	Amount	Invoice Date
XXX	John	Kennedy	15-01-2021	28-02-2021	137,586.00	20-04-2021
YYY	Peter	Paul	7-02-2021	31-03-2021	38,750.00	20-04-2021
ZZZ	Michael	K	10-03-2021	29-04-2021	137,586.00	30-04-2021

Code

df = pd.read_excel ('file.xlsx',parse_dates=['Start Date','End Date'] )
df['Start Date'] = pd.to_datetime(df['Start Date'],format='%d-%m-%Y')
df['End Date'] = pd.to_datetime(df['End Date'],format='%d-%m-%Y')

df['r'] = df.apply(lambda x: pd.date_range(x['Start Date'],x['End Date']), axis=1)
df = df.explode('r')
print(df)

months = df['r'].dt.month

starts, ends = months.ne(months.groupby(level=0).shift(1)), months.ne(months.groupby(level=0).shift(-1))


df2 = pd.DataFrame({'First Name': df['First name'],
           'Start Date': df.loc[starts, 'r'].dt.strftime('%Y-%m-%d'),
                   'End Date': df.loc[ends, 'r'].dt.strftime('%Y-%m-%d'),
                   'Date Diff': df.loc[ends, 'r'].dt.strftime('%d').astype(int)-df.loc[starts, 'r'].dt.strftime('%d').astype(int)+1})

df = df.loc[~df.index.duplicated(), :]

df2 = pd.merge(df, df2, left_index=True, right_index=True)

df2['Amount'] = df['Amount'].mul(df2['Date_Diff'])
print(df['Amount'])
print (df)

df.to_excel('report.xlsx', index=True)

Error ValueError: cannot reindex from a duplicate axis

Expected output

how to resolve this issue?

Does this answer your question? [What does \`ValueError: cannot reindex from a duplicate axis\` mean?](https://stackoverflow.com/questions/27236275/what-does-valueerror-cannot-reindex-from-a-duplicate-axis-mean) — yann ziselman, Jul 28 '21 at 08:20

Valdi_Bo · Accepted Answer · 2021-07-28T14:30:09.523

Start with some correction in your input Excel file, namely change First name to First Name - with capital "N", just like in other columns.

Then, to read your Excel file, it is enough to run:

df = pd.read_excel('Input.xlsx', parse_dates=['Start Date', 'End Date',
   'Invoice Date'], dayfirst=True)

No need to call to_datetime.

Note also that since Invoice Date contains also dates, I added this column to parse_dates list.

Then define two functions:

A function to get monthly data for the current row:
```
def getMonthData(grp, amnt, dayNo):
    return pd.Series([grp.min(), grp.max(), amnt * grp.size / dayNo],
        index=['Start Date', 'End Date', 'Amount'])
```
It converts the input Series of dates (for a single month) into the "new" content of the output rows (start / end dates and the proper share of the total amount, to be accounted for this month).

It will be called in the following function.

A function to "explode" the current row:

def rowExpl(row):
    ind = pd.date_range(row['Start Date'], row['End Date']).to_series()
    rv = ind.groupby(pd.Grouper(freq='M')).apply(getMonthData,
        amnt=row.Amount, dayNo=ind.size).unstack().reset_index(drop=True)
    rv.insert(0, 'Client', row.Client)
    rv.insert(1, 'First Name', row['First Name'])
    rv.insert(2, 'Last Name', row['Last Name'])
    return rv.assign(**{'Invoice Date': row['Invoice Date']})

And the last step is to get the result. Apply rowExpl to each row and concatenate the partial results into a single output DataFrame:

result = pd.concat(df.apply(rowExpl, axis=1).values, ignore_index=True)

The result, for your data sample is:

  Client First Name Last Name Start Date   End Date   Amount Invoice Date
0    XXX       John   Kennedy 2021-01-15 2021-01-31  51976.9   2021-04-20
1    XXX       John   Kennedy 2021-02-01 2021-02-28  85609.1   2021-04-20
2    YYY      Peter      Paul 2021-02-07 2021-02-28  16084.9   2021-04-20
3    YYY      Peter      Paul 2021-03-01 2021-03-31  22665.1   2021-04-20
4    ZZZ    Michael         K 2021-03-10 2021-03-31  59350.8   2021-04-30
5    ZZZ    Michael         K 2021-04-01 2021-04-29  78235.2   2021-04-30

Don't be disaffected by seemingly too low precision of Amount column. It is only the way how Jupyter Notebook displays the DataFrame.

When you run result.iloc[0, 5], you will get:

51976.933333333334

with full, actually held precision.

VaIdi_Bo tried to execute the function , but output is blank — Gomathi, Jul 28 '21 at 13:40
If you executed *result = pd.concat(…)* then the actual result is in *result* variable. To see this result, execute e.g. *print(result)*. Check also the file name (whether this file actually contains the expected data). — Valdi_Bo, Jul 28 '21 at 14:27
Another check: Instead of reading from an Excel file, load the data from e.g. a CSV file, then exeute my code. — Valdi_Bo, Jul 28 '21 at 14:32

How to resolve ValueError: cannot reindex from a duplicate axis

1 Answers1