0

Input

Client First name Last Name Start Date End Date Amount Invoice Date
XXX John Kennedy 15-01-2021 28-02-2021 137,586.00 20-04-2021
YYY Peter Paul 7-02-2021 31-03-2021 38,750.00 20-04-2021
ZZZ Michael K 10-03-2021 29-04-2021 137,586.00 30-04-2021

Code

df = pd.read_excel ('file.xlsx',parse_dates=['Start Date','End Date'] )
df['Start Date'] = pd.to_datetime(df['Start Date'],format='%d-%m-%Y')
df['End Date'] = pd.to_datetime(df['End Date'],format='%d-%m-%Y')

df['r'] = df.apply(lambda x: pd.date_range(x['Start Date'],x['End Date']), axis=1)
df = df.explode('r')
print(df)

months = df['r'].dt.month

starts, ends = months.ne(months.groupby(level=0).shift(1)), months.ne(months.groupby(level=0).shift(-1))


df2 = pd.DataFrame({'First Name': df['First name'],
           'Start Date': df.loc[starts, 'r'].dt.strftime('%Y-%m-%d'),
                   'End Date': df.loc[ends, 'r'].dt.strftime('%Y-%m-%d'),
                   'Date Diff': df.loc[ends, 'r'].dt.strftime('%d').astype(int)-df.loc[starts, 'r'].dt.strftime('%d').astype(int)+1})

df = df.loc[~df.index.duplicated(), :]

df2 = pd.merge(df, df2, left_index=True, right_index=True)

df2['Amount'] = df['Amount'].mul(df2['Date_Diff'])
print(df['Amount'])
print (df)

df.to_excel('report.xlsx', index=True)

Error ValueError: cannot reindex from a duplicate axis

Expected output

enter image description here

how to resolve this issue?

Gomathi
  • 131
  • 9
  • Does this answer your question? [What does \`ValueError: cannot reindex from a duplicate axis\` mean?](https://stackoverflow.com/questions/27236275/what-does-valueerror-cannot-reindex-from-a-duplicate-axis-mean) – yann ziselman Jul 28 '21 at 08:20

1 Answers1

1

Start with some correction in your input Excel file, namely change First name to First Name - with capital "N", just like in other columns.

Then, to read your Excel file, it is enough to run:

df = pd.read_excel('Input.xlsx', parse_dates=['Start Date', 'End Date',
   'Invoice Date'], dayfirst=True)

No need to call to_datetime.

Note also that since Invoice Date contains also dates, I added this column to parse_dates list.

Then define two functions:

  1. A function to get monthly data for the current row:

    def getMonthData(grp, amnt, dayNo):
        return pd.Series([grp.min(), grp.max(), amnt * grp.size / dayNo],
            index=['Start Date', 'End Date', 'Amount'])
    

    It converts the input Series of dates (for a single month) into the "new" content of the output rows (start / end dates and the proper share of the total amount, to be accounted for this month).

    It will be called in the following function.

  2. A function to "explode" the current row:

    def rowExpl(row):
        ind = pd.date_range(row['Start Date'], row['End Date']).to_series()
        rv = ind.groupby(pd.Grouper(freq='M')).apply(getMonthData,
            amnt=row.Amount, dayNo=ind.size).unstack().reset_index(drop=True)
        rv.insert(0, 'Client', row.Client)
        rv.insert(1, 'First Name', row['First Name'])
        rv.insert(2, 'Last Name', row['Last Name'])
        return rv.assign(**{'Invoice Date': row['Invoice Date']})
    

And the last step is to get the result. Apply rowExpl to each row and concatenate the partial results into a single output DataFrame:

result = pd.concat(df.apply(rowExpl, axis=1).values, ignore_index=True)

The result, for your data sample is:

  Client First Name Last Name Start Date   End Date   Amount Invoice Date
0    XXX       John   Kennedy 2021-01-15 2021-01-31  51976.9   2021-04-20
1    XXX       John   Kennedy 2021-02-01 2021-02-28  85609.1   2021-04-20
2    YYY      Peter      Paul 2021-02-07 2021-02-28  16084.9   2021-04-20
3    YYY      Peter      Paul 2021-03-01 2021-03-31  22665.1   2021-04-20
4    ZZZ    Michael         K 2021-03-10 2021-03-31  59350.8   2021-04-30
5    ZZZ    Michael         K 2021-04-01 2021-04-29  78235.2   2021-04-30

Don't be disaffected by seemingly too low precision of Amount column. It is only the way how Jupyter Notebook displays the DataFrame.

When you run result.iloc[0, 5], you will get:

51976.933333333334

with full, actually held precision.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • VaIdi_Bo tried to execute the function , but output is blank – Gomathi Jul 28 '21 at 13:40
  • If you executed *result = pd.concat(…)* then the actual result is in *result* variable. To see this result, execute e.g. *print(result)*. Check also the file name (whether this file actually contains the expected data). – Valdi_Bo Jul 28 '21 at 14:27
  • Another check: Instead of reading from an Excel file, load the data from e.g. a CSV file, then exeute my code. – Valdi_Bo Jul 28 '21 at 14:32