2

I have a couple of data frames. I want to get data from 2 columns from first data frame for marking the rows that are present in second data frame. First data frame (df1) looks like this

Sup4 Seats  Primary Seats   Back up Seats
 Pa   3       2              1
 Ka   2       1              1
 Ga   1       0              1
 Gee  1       1              0
 Re   2       2              0

(df2) looks like

Sup4    First   Last  Primary Seats     Backup Seats  Rating
Pa      Peter   He          NaN         NaN           2.3
Ka      Sonia   Du          NaN         NaN           2.99
Ga      Agnes   Bla         NaN         NaN           3.24
Gee    Jeffery  Rus         NaN         NaN           3.5
Gee    John     Cro         NaN         NaN           1.3
Pa     Pavol    Rac         NaN         NaN           1.99
Pa     Ciara    Lee         NaN         NaN           1.88
Re     David    Wool        NaN         NaN           2.34
Re     Stefan   Rot         NaN         NaN           2
Re     Franc    Bor         NaN         NaN           1.34
Ka     Tania    Le          NaN         NaN           2.35

the output i require for each Sup4 name is to be grouped also by sorting the Rating from highest to lowest and then mark the columns for seats based on the df1 columns Primary Seats and Backup seats.

i did grouping and sorting for first Sup4 name Pa for sample and i have to do for all the names

Sup4    First   Last      Primary Seats   Backup Seats  Rating
Pa      Peter   He                  M                     2.3
Pa      Pavol   Rac                 M                     1.99
Pa      Ciara   Lee                           M           1.88
Ka      Sonia   Du                  M                     2.99
Ka      Tania   Le                            M           2.35
Ga      Agnes   Bla                           M           3.24
:
:
:

continues like this

I have tried until grouping and sorting

sorted_df = df2.sort_values(['Sup4','Rating'],ascending=[True,False])

however i need help to pass df1 columns values to mark in second dataframe

Vij
  • 25
  • 4

1 Answers1

1

Solution #1:

You can do a merge, but you need to include some logic to update your Seats columns. Also, it is important to mention that you need to decide what to do with data with unequal lengths. ~GeeandRe` have unequal lengths in both dataframes. More information in Solution #2:

df3 = (pd.merge(df2[['Sup4', 'First', 'Last', 'Rating']], df1, on='Sup4')
         .sort_values(['Sup4', 'Rating'], ascending=[True, False]))
s = df3.groupby('Sup4', sort=False).cumcount() + 1
df3['Backup Seats'] = np.where(s - df3['Primary Seats'] > 0, 'M', '')
df3['Primary Seats'] = np.where(s <= df3['Primary Seats'], 'M', '')
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]: 
   Sup4    First  Last Primary Seats Backup Seats  Rating
5    Ga    Agnes   Bla                          M    3.24
6   Gee  Jeffery   Rus             M                  3.5
7   Gee     John   Cro                          M     1.3
3    Ka    Sonia    Du             M                 2.99
4    Ka    Tania    Le                          M    2.35
0    Pa    Peter    He             M                  2.3
1    Pa    Pavol   Rac             M                 1.99
2    Pa    Ciara   Lee                          M    1.88
8    Re    David  Wool             M                 2.34
9    Re   Stefan   Rot             M                  2.0
10   Re    Franc   Bor                          M    1.34

Solution #2:

After doing this solution, I realized Solution #1 would be much simpler, but I thought I mine as well include this. Also, this gives you insight on what do with values that had unequal size in both dataframes. You can reindex the first dataframe and use combine_first() but you have to do some preparation. Again, you need to decide what to do with data with unequal lengths. In my answer, I have simply excluded Sup4 groups with unequal lengths to guarantee that the indices align when finally calling combine_first():

# Purpose of `mtch` is to check if rows in second dataframe are equal to the count of seats in first.
# If not, then I have excluded the `Sup4` with unequal lengths in both dataframes
mtch = df1.groupby('Sup4')['Seats'].first().eq(df2.groupby('Sup4').size())
df1 = df1.sort_values('Sup4', ascending=True)[df1['Sup4'].isin(mtch[mtch].index)]
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)

#`reindex` the dataframe, get the cumulative count, and manipulate data with `np.where`
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)
s = df1.groupby('Sup4').cumcount() + 1
df1['Backup Seats'] = np.where(s - df1['Primary Seats'] > 0, 'M', '')
df1['Primary Seats'] = np.where(s <= df1['Primary Seats'], 'M', '')

#like df1, in df2 we exclude groups with uneven lengths and sort
df2 = (df2[df2['Sup4'].isin(mtch[mtch].index)]
       .sort_values(['Sup4', 'Rating'], ascending=[True, False]).reset_index(drop=True))

#can use `combine_first` since we have ensured that the data is sorted and of equal lengths in both dataframes
df3 = df2.combine_first(df1)

#order columns and only include required columns
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]: 
  Sup4  First Last Primary Seats Backup Seats  Rating
0   Ga  Agnes  Bla                          M    3.24
1   Ka  Sonia   Du             M                 2.99
2   Ka  Tania   Le                          M    2.35
3   Pa  Peter   He             M                  2.3
4   Pa  Pavol  Rac             M                 1.99
5   Pa  Ciara  Lee                          M    1.88
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • Thanks a lot for helping me here David Erickson, I have tried with Solution #1and it worked well, going to try Solution #2 as well. Appreciate your time and recommendations. – Vij Jan 05 '21 at 19:03