1

I've spent the last hour doing some data entry and have hit a brick wall in Python now.

Basically I have a set of data in JSON, where I want to sum the values from field price to add up to a certain value (14.0 in my case). The final result should maximise the sum of the return field. Here's an example of my dataset (there are more teams and fields):

[
  { "team": "England", "price": 7.0, "return": 2.21 },
  { "team": "Belgium", "price": 7.0, "return": 2.27 },
  { "team": "Spain", "price": 6.0, "return": 2.14 },
  { "team": "Slovakia", "price": 1.0, "return": 0.97 }
]

So in this case, there are 3 possible answers:

a) England, Belgium (4.48)

b) England, Spain, Slovakia (5.28)

c) Belgium, Spain, Slovakia (5.38)

With c) being the optimum because it has the biggest sum of return (5.38). I would like to use Python to implement the solution.

I've had a look at this question, but can't seem to figure out how to implement it in my case: Finding all possible combinations of numbers to reach a given sum

clattenburg cake
  • 1,096
  • 3
  • 19
  • 40
  • How are you forming the groups? On what basis did you club England Belgium first and then rest? Can you clarify a bit. This can be attempted using pandas – Mahendra Singh Jun 11 '21 at 07:21
  • Apologies... I should clarify, the dataset is just an array of objects. There are 24 objects in the array, instead of 4 that I have given in the example – clattenburg cake Jun 11 '21 at 07:23
  • @MahendraSingh the sum of the ```price``` in each group is equal to ```14``` – 99_m4n Jun 11 '21 at 07:24
  • 3
    This problem is not simple. You need to iterate all combinations of size n with the above array which is n*(2^size). It will exponentially increase with the size of the array. It will be quite computationally intensive. – Vishnudev Krishnadas Jun 11 '21 at 07:42
  • 2
    Your problem is an instance of the [Bin packing problem](https://en.wikipedia.org/wiki/Bin_packing_problem). The general case is NP-hard and a field of active research, so I am not _too_ surprised that you hit a brick wall. As @Vishnudev suggests, the general solution is to iterate over all possible combinations and pick out the ones that apply and save the maximum. – FirefoxMetzger Jun 11 '21 at 07:53
  • @FirefoxMetzger Thanks, do you know some boilerplate code I can refer to that allows me to iterate over all the combos? – clattenburg cake Jun 11 '21 at 08:17

4 Answers4

2

NOTE: This solution takes one assumption which is unique team names in the data for doing indexing in the converted dataframe.

Firstly, Convert your JSON data to a pandas dataframe.

from itertools import combinations
import pandas as pd

data = [
    {"team": "England", "price": 7.0, "return": 2.21 },
    {"team": "Belgium", "price": 7.0, "return": 2.27 },
    {"team": "Spain", "price": 6.0, "return": 2.14 },
    { "team": "Slovakia", "price": 1.0, "return": 0.97 }
    ]

df = pd.DataFrame(data)

       team  price  return
0   England    7.0    2.21
1   Belgium    7.0    2.27
2     Spain    6.0    2.14
3  Slovakia    1.0    0.97

Convert team column to list

teams = df['team'].tolist()
['England', 'Belgium', 'Spain', 'Slovakia']

Next, we generate all possible combinations from the teams list

all_team_combinations = []

for i in range(1, len(teams)):
  all_team_combinations.extend(list(combinations(teams, i)))
  i += 1

Now, we check for price constraints

price_threshold = 14
team_combinations_with_price_constraint = [c for c in all_team_combinations if df.loc[df['team'].isin(list(c)), 'price'].sum() == price_threshold]

print(team_combinations_with_price_constraint)

[('England', 'Belgium'), ('England', 'Spain', 'Slovakia'), ('Belgium', 'Spain', 'Slovakia')]

Next, we calculate sum of returns from the combinations with constraint

combinations_return_sum = [round(df.loc[df['team'].isin(list(c)), 'return'].sum(), 3) for c in team_combinations_with_price_constraint]

print(combinations_return_sum)
[4.48, 5.32, 5.38]

Finally, use the index of max return sum value to get the desired combination

team_combinations_with_price_constraint[combinations_return_sum.index(max(combinations_return_sum))]

which yields

('Belgium', 'Spain', 'Slovakia')

For checking the combination-return_sum map you can create a dictionary like this.

combination_return_map = dict(zip(team_combinations_with_price_constraint, combinations_return_sum))

print(combination_return_map)

{('England', 'Belgium'): 4.48, ('England', 'Spain', 'Slovakia'): 5.32, ('Belgium', 'Spain', 'Slovakia'): 5.38}
Mahendra Singh
  • 508
  • 2
  • 9
1

Building on the previous SO solution of subset sum, and using pandas

I use pandas to handle indexing data I didn't get if you can pick England twice in your example, but i went ahead as if you could, solving it using pandas and itertools, pandas can be omitted.

import pandas as pd
from itertools import product, groupby

# i use pandas to handle indexing data
your_json = [
  { "team": "England", "price": 7.0, "return": 2.21 },
  { "team": "Belgium", "price": 7.0, "return": 2.27 },
  { "team": "Spain", "price": 6.0, "return": 2.14 },
  { "team": "Slovakia", "price": 1.0, "return": 0.97 }
]
your_data = pd.DataFrame(your_json)

#Copied iterator from previous SO solution. It generates values that sum to your target

def subset_sum(numbers, target, partial=[], partial_sum=0):
    if partial_sum == target:
        yield partial
    if partial_sum >= target:
        return
    for i, n in enumerate(numbers):
        remaining = numbers[i + 1:]
        yield from subset_sum(remaining, target, partial + [n], partial_sum + n)

#To get the indexes that match the price values, the solutions values are iterated over:

soltion_indexes =[]
for solution_values in subset_sum( your_data.price, 14):
    possible_index= []
    for value in solution_values:
        #indexes that have the right value are added to list of possible indexes for this solution
        possible_index.append( your_data[your_data.price == value].index.tolist() )
    # in order to get all combinations, product from itertools is used
    listed_posible_indexes = list(product(*(possible_index)))
    # if indexes not allready in solution, and it does not contain the same row twince, they are added to sultion indexes. 
    for possible_indexes in  listed_posible_indexes:
        possible_solution_indexes = sorted(list(possible_indexes))
        if possible_solution_indexes not in soltion_indexes and not any(
            possible_solution_indexes.count(x) > 1 for x in possible_solution_indexes) :
            soltion_indexes.append(possible_solution_indexes)

#Then pull the rows for each index in the solution indexes, to create a dataframe with the full rows for your solutions, including return.

i=0 
all_solutions= pd.DataFrame()
for combinations in soltion_indexes:
    i+=1
    solution = your_data.iloc[combinations]
    solution["solution_number"]= i 
    all_solutions = pd.concat([all_solutions,solution])

#Then the the sum of return for each group is found:

ranked_groups_by_return = all_solutions.groupby("solution_number")['return'].sum().sort_values()

#the the best group is found and printed:

best = all_solutions[all_solutions.solution_number == ranked_groups_by_return.index[-1]]
print(best)

       team  price  return  solution_number
1   Belgium    7.0    2.27                3
2     Spain    6.0    2.14                3
3  Slovakia    1.0    0.97                3
Søren
  • 81
  • 3
1

We need to iterate all the combinations of size n which is less than the size of the number of elements in the array to capture all possible combinations. Then just apply your conditions to get the combination with maximum returns.

from itertools import combinations

data = [
  { "team": "England", "price": 7.0, "return": 2.21 },
  { "team": "Belgium", "price": 7.0, "return": 2.27 },
  { "team": "Spain", "price": 6.0, "return": 2.14 },
  { "team": "Slovakia", "price": 1.0, "return": 0.97 }
]

sum_data = []
COMB_SUM = 14  # Desired combination sum

max_combi = None
max_sum_return = float('-inf')  # Lowest possible value as temporary maximum

for i in range(len(data), 0, -1):  # 4, 3, 2, 1
    combsi = list(combinations(data, i))  # Combinations of size n
    for index, combi in enumerate(combsi):
        if sum(item['price'] for item in combi) == COMB_SUM:
            sum_return = sum(item['return'] for item in combi)
            if sum_return > max_sum_return:
                max_sum_return = sum_return
                max_combi = combi

print(max_combi)
print(max_sum_return)

Output

(
    {'team': 'Belgium', 'price': 7.0, 'return': 2.27},
    {'team': 'Spain', 'price': 6.0, 'return': 2.14},
    {'team': 'Slovakia', 'price': 1.0, 'return': 0.97}
)
5.38
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
0

Sure, I'd go with something like this

import numpy.ma as ma
import numpy as np
import pandas as pd

df = pd.DataFrame([
  { "team": "England", "price": 7.0, "return": 2.21 },
  { "team": "Belgium", "price": 7.0, "return": 2.27 },
  { "team": "Spain", "price": 6.0, "return": 2.14 },
  { "team": "Slovakia", "price": 1.0, "return": 0.97 }
])

price_limit = 14
powers_of_two = np.array([1<<n for n in range(len(df))])
combinations = (np.arange(2**len(df))[:, None] & powers_of_two)[1:].astype(bool)

prices = ma.masked_array(np.tile(df.price, (len(combinations), 1)), mask=~combinations)
valid_combinations = (prices.sum(axis=-1) == price_limit)

returns = ma.masked_array(np.tile(df["return"], (len(combinations), 1)), mask=~(valid_combinations[:, None] & combinations))

best = np.argmax(returns.sum(axis=-1))

print(f"Best combination (price={df['price'][combinations[best]].sum():.0f}): {' + '.join(df.team[combinations[best]].to_list())} = {df['return'][combinations[best]].sum():.2f}")
# prints: Best combination (price=14): Belgium + Spain + Slovakia = 5.38

This is a bit liberal in terms of memory usage, but that can be improved by simply restriding df.price and df.return instead of tiling them

FirefoxMetzger
  • 2,880
  • 1
  • 18
  • 32
  • Note: In many practical applications one is not (just) interested in the maximum under the equality constraint `(prices.sum(axis=-1) == price_limit)`, but rather the maximum under the inequality constraint `(prices.sum(axis=-1) <= price_limit)`. – FirefoxMetzger Jun 11 '21 at 09:50