1

I have two dataframes:

In df1, i got an order of id's assigned to people, each person can have at most 2 id's:

df1:

id1      id2
2040     0
2041     2050
2042     0
2043     0
2044     2051
2045     2052

In df2, i got a list of payments and id's for these people but not arranged:

df2:

id      amount
2040     10
2040     10
2053     5
2043     5
2052     10
2045     5

What i'm looking for is a way to create a df3 that organizes payments in the specific order of the df1 taking in consideration the possibillity of a person paying more than once: Example:

df3:

id1      id2      payment    
2040     0         20          
2041     2050      0            
2042     0         0            
2043     0         5            
2044     2051      0            
2045     2052      15           

and a df4 if there is any id in df2 that doesn't exist in df1 for future correction

df4:

id     amount
2053   5
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
  • 1
    I suggest you reconsider. Is there anything that makes `payment1` and `payment2` fundamentally different? If not, then it doesn't make sense to represent them with different columns. If one customer makes 10000 purchases of the same item, you will have a fat dataframe of mostly zeros for example – anon01 Feb 14 '21 at 21:15
  • @anon01 I think that i could in fact have just one payment column and add the values in there and if its needed, just take de raw data and see who pay what. I'll edit the question. – Tomas Arias Feb 14 '21 at 21:22
  • that would be a much more common way to do it. Can you state what exactly you're trying to achieve? – anon01 Feb 14 '21 at 21:23
  • What you're looking to do with df3 is a simple merge. That could be achieved by newdf3 = pd.merge(df1, df2, on='id'). For the other bit in df4 refer to https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe – Nirbhay Tandon Feb 14 '21 at 21:27
  • @anon01 Yes, of course, the reason i need the arranged following the order of the **df1** its because that is the order of a spreadsheet that needs to be filled f whith the data of **df2** (consisting of around 10000 rows of data once a month) – Tomas Arias Feb 14 '21 at 21:30

1 Answers1

0

UPDATED : Dataframe.replace

df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum())
                            .where(df1.isin(df2['id'].tolist()))
                            .sum(axis=1))  

%%timeit
df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum())
                            .where(df1.isin(df2['id'].tolist())).sum(axis=1))  
4.37 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df3 = df1.assign(payment=df1.reset_index().melt('index')
                            .assign(value=lambda x: x.value.map(df2.groupby('id')['amount']
                                                                   .sum()))
                            .groupby('index')['value'].sum())
8.83 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

I like to use DataFrame.melt because we don't depend on the number of columns in df1

df3 = df1.assign(payment=df1.reset_index().melt('index')
                            .assign(value=lambda x: x.value.map(df2.groupby('id')
                                                                   .amount
                                                                   .sum()))
                            .groupby('index')['value'].sum())
print(df3)

    id1   id2  payment
0  2040     0     20.0
1  2041  2050      0.0
2  2042     0      0.0
3  2043     0      5.0
4  2044  2051      0.0
5  2045  2052     15.0

And df4:

df4 = df2.merge(df1.melt(), 
                left_on='id', 
                right_on='value', 
                indicator=True, 
                how='left')\
    .loc[lambda x: x._merge=='left_only', df2.columns]

print(df4)

     id  amount
2  2053       5
ansev
  • 30,322
  • 5
  • 17
  • 31