Compare a dataframe with a groupby one and find missing entries

Question

In the current project I am working on I need to check a df_totals dataframe against a df_fact and give an output to the user as the df_output in the image below of which lines should be deleted so that df_fact would reconcile with df_totals.

PROBLEM: Basically I want to try to find combinations of values in the df_fact['VAL'] by a company that needs to be deleted so that the SUM of VAL by COMPANY in df_fact should be equal to df_totals['TOTALS'].

I believe it is a complex solution because such an algorithm to solve it will possibly do so through the use of combinations, due to the fact that I need to delete one or more lines by company names.

What I'm trying to do

Here you can see the code I have written so far to try to solve it. I have created a merged dataframe to calculate the differences and iterate over them.

import pandas as pd

# group fact by company
df_fact_agg = df_fact.groupby('COMPANY', as_index=False).sum()
df_fact_agg.head()

# merge the two dataframes
df_compare = pd.merge(
    df_totals,
    df_fact_agg,
    left_index=True, 
    right_index=True
)

# clean the dataframe
df_compare.drop(['COMPANY_y'], axis=1, inplace=True)
df_compare.rename(columns={"COMPANY_x": "COMPANY", "TOTALS": "LEFT_TOTALS", "VAL": "RIGHT_VAL"}, inplace=True)

# create a new column to calculate the difference between values
df_compare['Diff'] = df_compare['RIGHT_VAL']-df_compare['LEFT_TOTALS']

# try to find find a way to solve it
for val in df_compare.COMPANY.unique().tolist():
    # for each unique name of company get the current difference
    difference_value = df_compare[df_compare['COMPANY']==val].Diff.values[0]
    # get all values from series 
    series_differences = df_fact[df_fact['COMPANY']==val]['VAL']
    
    print('The TOTAL difference calculated is: ', difference_value)
    print('The serie that I need to work to check whether value(s) need to be deleted is: ', series_differences.values)
    
    #####################################################################
    ########## from now on I do not know what to do #####################
    ################## please help me =) ################################
    #####################################################################

The output

The TOTAL difference calculated is: 400.0
The serie that I need to work to check whether value(s) is (are) missing or remaining is: [5000. 3000. 400.]
The TOTAL difference calculated is: 724.0
The serie that I need to work to check whether value(s) is (are) missing or remaining is: [ 50. 5432. 7894. 674.]
The TOTAL difference calculated is: 895.0 The serie that I need to work to check whether value(s) is (are) missing or remaining is: [ 896. 4578. 20. 547. 6348. 875.]
The TOTAL difference calculated is: 14451.0
The serie that I need to work to check whether value(s) is (are) missing or remaining is: [ 967. 678. 645. 87. 9564. 3564.]

The data I am using

df_totals
{'COMPANY': {0: 'COMPANY 1', 1: 'COMPANY 2', 2: 'COMPANY 3', 3: 'COMPANY 4'},
 'TOTALS': {0: 8000.0, 1: 13326.0, 2: 12369.0, 3: 1054.0}}

df_fact.columns
['_KEY', 'COMPANY', 'DOC NUM', 'VAL']

df_fact.values
[['COMPANY 1DOC 123', 'COMPANY 1', 'DOC 123', 5000.0],
 ['COMPANY 1DOC 124', 'COMPANY 1', 'DOC 124', 3000.0],
 ['COMPANY 1DOC 125', 'COMPANY 1', 'DOC 125', 400.0],
 ['COMPANY 2DOC 126', 'COMPANY 2', 'DOC 126', 50.0],
 ['COMPANY 2DOC 127', 'COMPANY 2', 'DOC 127', 5432.0],
 ['COMPANY 2DOC 128', 'COMPANY 2', 'DOC 128', 7894.0],
 ['COMPANY 2DOC 129', 'COMPANY 2', 'DOC 129', 674.0],
 ['COMPANY 3DOC 130', 'COMPANY 3', 'DOC 130', 896.0],
 ['COMPANY 3DOC 131', 'COMPANY 3', 'DOC 131', 4578.0],
 ['COMPANY 3DOC 132', 'COMPANY 3', 'DOC 132', 20.0],
 ['COMPANY 3DOC 133', 'COMPANY 3', 'DOC 133', 547.0],
 ['COMPANY 3DOC 134', 'COMPANY 3', 'DOC 134', 6348.0],
 ['COMPANY 3DOC 135', 'COMPANY 3', 'DOC 135', 875.0],
 ['COMPANY 4DOC 136', 'COMPANY 4', 'DOC 136', 967.0],
 ['COMPANY 4DOC 137', 'COMPANY 4', 'DOC 137', 678.0],
 ['COMPANY 4DOC 138', 'COMPANY 4', 'DOC 138', 645.0],
 ['COMPANY 4DOC 139', 'COMPANY 4', 'DOC 139', 87.0],
 ['COMPANY 4DOC 140', 'COMPANY 4', 'DOC 140', 9564.0],
 ['COMPANY 4DOC 141', 'COMPANY 4', 'DOC 141', 3564.0]]

df_output
{'CHECK COLUMNS': {0: nan,
  1: nan,
  2: 'DELETE',
  3: 'DELETE',
  4: nan,
  5: nan,
  6: 'DELETE',
  7: nan,
  8: nan,
  9: 'DELETE',
  10: nan,
  11: nan,
  12: 'DELETE',
  13: nan,
  14: 'DELETE',
  15: 'DELETE',
  16: nan,
  17: 'DELETE',
  18: 'DELETE'}}

I think you want the [answer here](https://stackoverflow.com/questions/3420937/algorithm-to-find-which-number-in-a-list-sum-up-to-a-certain-number). — BLimitless, Jan 28 '21 at 17:51
Thank you for this link. I believe this fit totally with my case. I really appreciate your help. — Arnold Souza, Jan 28 '21 at 19:51

Compare a dataframe with a groupby one and find missing entries

What I'm trying to do

The output

The data I am using

0 Answers0