1

I have a dataframe with dates and origins and amount, now I need to find out, for the same date, the corresponding sum in origin "s" and two amounts in origin "c" summing up to get the corresponding sum in "s", if it is not clear, here is an example :

df = pd.DataFrame( {"amounts": [12, 13, 1, 14, 2, 15, 25, 29, 45], 
                       "date": ['2022/11/11', '2022/11/11', '2022/11/11', '2022/11/12','2022/11/12',   '2022/11/12',
                              '2022/11/11', '2022/11/12', '2022/11/12'], 
                      "origin": ['C', 'C', 'C', 'C','C', 'S', 'S', 'S']})

so i need to get an output precizing 12, 13, and 25 are one group and 14, 15 and 29 another group, identified by a new column with letters A, B, C, D... or other.

Can anyone give some idea please?

I tried to group by fistly the df by date and origin, and this function, but i m not sure if there is better solution, because this function do not identify numbers with a new column:

def check_sum_exists(nums, target_sum):
  for i in range(len(nums)):
       # one number sum
        if nums[i] == target_sum:
            return True
        # two numbers sum
        for j in range(i+1, len(nums)):
            if nums[i] + nums[j] == target_sum:
                return True
        # three numbers sum
            for k in range(i+2, len(nums)):
                if nums[i] + nums[j]+ nums[k] == target_sum:
                    return True
return False
N R
  • 11
  • 2
  • how many rows per origin do you have? Are you sure that the problem always has a solution? What if you can have more than one solution? This is a well known problem ([subset sum](https://en.wikipedia.org/wiki/Subset_sum_problem)) – mozway Mar 07 '23 at 14:01
  • @mozway *"Subset sum is known to be NP-hard"*, but here the subsets are constrained to have size 2, so it's definitely polynomial – Stef Mar 07 '23 at 14:03
  • @Stef that's why I asked how this generalizes, currently the question is too vague and reads a lot like a "code this for me!" – mozway Mar 07 '23 at 14:06
  • hello, i reedited the question, the number is not definitely 2 – N R Mar 07 '23 at 14:19

1 Answers1

0

I don't know how to use cool pandas functions to solve your problem. Hopefully someone else can post an answer doing that.

In the meantime, here is one way to solve your answer by grouping the data in a dict of dicts:

import pandas as pd
from itertools import combinations
from string import ascii_uppercase

df = pd.DataFrame( {"amounts": [12, 13, 14, 15, 25, 29], 
                       "date": ['2022/11/11', '2022/11/11', '2022/11/12',
                              '2022/11/12', '2022/11/11', '2022/11/12'], 
                      "origin": ['C', 'C', 'C', 'C', 'S', 'S']})

grouped = {}
for r in df.itertuples():
    grouped.setdefault(r.date, {}).setdefault(r.origin, []).append((r.Index, r.amounts))
# grouped = {'2022/11/11': {'C': [(0, 12), (1, 13)], 'S': [(4, 25)]}, '2022/11/12': {'C': [(2, 14), (3, 15)], 'S': [(5, 29)]}}

df['new column'] = 0
combination_index = 0
for g in grouped.values():
    for (i, x), (j, y) in combinations(g.get('C', ()), 2):
        for k, s in g.get('S', ()):
            if x+y == s:
                df.at[i, 'new column'] = ascii_uppercase[combination_index]
                df.at[j, 'new column'] = ascii_uppercase[combination_index]
                df.at[k, 'new column'] = ascii_uppercase[combination_index]
                combination_index += 1

print(df)

   amounts        date origin new column
0       12  2022/11/11      C          A
1       13  2022/11/11      C          A
2       14  2022/11/12      C          B
3       15  2022/11/12      C          B
4       25  2022/11/11      S          A
5       29  2022/11/12      S          B

To go further:

  • If you have more than 26 valid combinations

Note that ascii_uppercase[combination_index] will crash if there are more than 26 valid combinations in your dataframe, because the alphabet only has 26 letters. If that happens, then instead of ascii_uppercase[combination_index], I suggest using to_excel(combination_index), where to_excel is the function from Convert a number to Excel’s base 26?. This will result in the sequence of strings A, B, ..., Z, AA, AB, ..., AZ, BA, ..., ZZ, AAA, ...

  • If you want to sum combinations of more than 2 elements

Instead of iterating on pairs of rows with combinations(___, 2), you can use function powerset from itertools recipes. Fair warning: the number of combinations tested by combinations(___, 2) is quadratic in the number of elements, but the number of combinations tested by powerset(___) is exponential. Only use powerset if the number of elements that have the same date and origin 'C' is small.

  • If there are many values with same date and origin 'S'

The code above has an inner loop for k, s in g.get('S', ()): that searches for x+y in g['S'] by iterating over every element of g['S']. This can be pretty efficient if g['S'] is large. One way to solve this problem is by using the sum as a dict key instead of as a dict value; this way you can test directly if x+y in g['S'] without the need for a loop.

Stef
  • 13,242
  • 2
  • 17
  • 28
  • If I have to calculate the sum of g['S']'s if there a better way than reusing combinaisons to have x+y==s1+s2? actually the number of elements in the two sides to sum and compare are random. – N R Mar 09 '23 at 14:42
  • @NR I don't understand, what are `s1` and `s2`? Obviously I don't know everything about your setting, only what you explained in your post; but you can use `powerset` instead of `combinations` to try larger subsets. – Stef Mar 09 '23 at 15:14