combine two complete rows if certain criteria is met

Question

I've been able to extract data from two separate xlsx and combine them into a single xlsx sheet using pandas.

I know have a table that looks like this.

Home                Start Date     Gross Earning Tax Gross Rental Commission Net Rental
3157    2020-03-26 00:00:00 -268.8  -28.8   -383.8  -36 -338.66
3157    2020-03-26 00:00:00 268.8   28.8    153.8   36  108.66
3157    2020-03-24 00:00:00 264.32  28.32   149.32  35.4    104.93
3157    2020-03-13 00:00:00 625.46  67.01   510.46  83.7675 405.4225
3157    2020-03-13 00:00:00 558.45  0   443.45  83.7675 342.9325
3157    2020-03-11 00:00:00 142.5   0   27.5    21.375  1.855
3157    2020-03-11 00:00:00 159.6   17.1    44.6    21.375  17.805
3157    2020-03-03 00:00:00 349.52  0   234.52  52.428  171.612
3157    2020-03-03 00:00:00 391.46  41.94   276.46  52.428  210.722

So if you take a look at the first two rows, the name in the Home column is the same (In this example, 3157 Tocoa) but they are also the same for the next few rows. But in the Start date column, only the first two items in that column are the same (In this case 3/26/2020 12:00:00 AM) So what i need to do is the following

If the dates are the same, and the Home is the same, then I need the sum of all of the following columns. (In this case, I would need the sum of -268.8 and 268.8, the sum of -28.8 and 28.8 and so on) It is also important to mention there are instances where there is a total of more than two matching start dates.

I will include the code I have used to get to where I am now, I would like to mention I am fairly new to python so I'm sure there is a way to do this super simple but I am just not familiar. I am also new to stackoverflow so if I am missing something or added something I should have please forgive me

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np
import matplotlib.pyplot as plt
import os

# class airbnb:

#Gets the location path for the reports that come raw from the channel
airbnb_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with 
python\Data_to_read\Bnb_feb_report.xlsx')
empty_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with 
python\Data_to_read\empty.xlsx')

#Defines the data frame
df_airbnb = pd.read_excel(airbnb_excel_file)
df_empty = pd.read_excel(empty_excel_file)

gross_earnings = df_airbnb['Gross Earnings']
tax_amount = df_airbnb['Gross Earnings'] * 0.06
gross_rental = df_airbnb['Gross Earnings'] - df_airbnb['Cleaning Fee']
com = ((gross_rental - tax_amount) + df_airbnb['Cleaning Fee']) * 0.15
net_rental = (gross_rental - (com + df_airbnb['Host Fee']))
house = df_airbnb['Listing']
start_date = df_airbnb['Start Date']

# df = pd.DataFrame(df_empty)
# df_empty.replace('nan', '')
#
# print(net_rental)

df_report = pd.DataFrame(
    {'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': tax_amount,
     'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})

df_report.loc[(df_report.Home == 'New house, Minutes from Disney & Attraction'), 'Home'] = '3161 
Tocoa'
df_report.loc[(df_report.Home == 'Brand-New House, located minutes from Disney 5151'), 'Home'] = 
'5151 Adelaide'
df_report.loc[(df_report.Home == 'Luxury House, Located Minutes from Disney-World 57'), 'Home'] = 
'3157 Tocoa'
df_report.loc[(df_report.Home == 'Big house, Located Minutes from Disney-World 55'), 'Home'] = '3155 
Tocoa'

df_report.sort_values(by=['Home'], inplace=True)

# writer = ExcelWriter('Final_Report.xlsx')
# df_report.to_excel(writer, 'sheet1', index=False)
# writer.save()



# class homeaway:
homeaway_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with 
python\Data_to_read\PayoutSummaryReport2020-03-01_2020-03-29.xlsx')
df_homeaway = pd.read_excel(homeaway_excel_file)

cleaning = int(115)

house = df_homeaway['Address']
start_date = df_homeaway['Check-in']
gross_earnings = df_homeaway['Gross booking amount']
taxed_amount = df_homeaway['Lodging Tax Owner Remits']
gross_rental = (gross_earnings - cleaning)
com = ((gross_rental-taxed_amount) + cleaning) * 0.15
net_rental = (gross_rental - (com + df_homeaway['Deductions']))

df_report2 = pd.DataFrame(
    {'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': taxed_amount,
     'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})

# writer = ExcelWriter('Final_Report2.xlsx')
# df_report2.to_excel(writer, 'sheet1', index=False)
# writer.save()


df_combined = pd.concat([df_report, df_report2])

writer = ExcelWriter('Final_Report_combined.xlsx')
df_report2.to_excel(writer, 'sheet1', index=False)
writer.save()

hi christopher, a better way around this is to share data, not pics. share ur input data (a couple of lines that represent ur question) and also share what ur expected output should be. Kindly [read this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for further guidance. — sammywemmy, Mar 31 '20 at 05:09
Hi Christopher, if you could post even a sample of you initial dataframes, I would be happy to write a solution for you. Without the data, it is difficult to determine what you are trying to do exactly. If you don't post your data, I suggest you check out Pandas ".groupby()" and ".pivot_tables()" methods. I believe you can most likely achieve the result you want with df.groupby(['Start Date', 'Home']). You may also want to check out Pandas ".to_excel()" method. Here is the link to groupby: [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) — Michael Longstreth, Mar 31 '20 at 05:32
Please forgive me but I have been trying to figure out how to add my data frames for you but I haven't been able to do so. Whenever I copy the xlsx part I want it imports it as a picture. — Christopher Gonzalez, Mar 31 '20 at 05:59

score 0 · Answer 1 · edited Mar 31 '20 at 19:31

One of possible approaches is to group by Home and Start Date and then compute sum of rows involved:

df.groupby(['Home', 'Start Date']).sum()

Fortunately, all "other" columns are numeric, so no column specification is needed.

But if there are more than 2 rows with same Home and Start Date and you want to:

break them into pairs of consecutive rows,
and then compute their sums (for each pair separately),

you should apply a "2-tier" grouping:

first tier - group by Home and Start Date (as before),
second tier - group into pairs,

and compute sums for each second-level group.

In this case the code should be:

df.groupby(['Home', 'Start Date']).apply(
    lambda grp: grp.groupby(np.arange(len(grp.index)) // 2).sum())\
    .reset_index(level=-1, drop=True)

Additional operation required here is to drop the last level of the index (reset_index).

To test this approach, e.g. add the following row to your DataFrame:

1234 Bogus Street,2020-03-26 00:00:00,20.0,2.0,15.0,3,10.0

so that 1234 Bogus Street / 2020-03-26 00:00:00 group now contains three rows.

When you run the above code, you will get:

                                       Gross Earning    Tax  Gross Rental  Commission  Net Rental
Home              Start Date                                                                     
1234 Bogus Street 2020-03-03 00:00:00         740.98  41.94        510.98     104.856     382.334
                  2020-03-11 00:00:00         302.10  17.10         72.10      42.750      19.660
                  2020-03-13 00:00:00        1183.91  67.01        953.91     167.535     748.355
                  2020-03-24 00:00:00         264.32  28.32        149.32      35.400     104.930
                  2020-03-26 00:00:00           0.00   0.00       -230.00       0.000    -230.000
                  2020-03-26 00:00:00          20.00   2.00         15.00       3.000      10.000

Note the last row. It contains:

repeated Start Date (from the previous row),
values from the added row.

And the last but one row contains sums for only two first rows with respective Home / Start Date.

Hi, I don't know if I want to "break them into pairs of consecutive rows, and then compute their sums (for each pair separately)," What I want to do is simply add the third line. — Christopher Gonzalez, Mar 31 '20 at 08:28
So use my first approach (all rows with same Home / Start Date are summed). — Valdi_Bo, Mar 31 '20 at 09:13

combine two complete rows if certain criteria is met

1 Answers1