Below is the requirement.
There are 2 tables: brand_df
(Brands' value) and score_df
(containing Subject score for each brand). [Generating samples below]
brand_dict = {'Brand': ['a', 'b', 'c', 'g', 'e', 'i', 'd', 'h', 'f'], 'Value': [2, 1, 3, 1, 4, 1, 3, 2, 3]}
brand_df = pd.DataFrame(brand_dict, columns=['Brand', 'Value'])
score_dict = {'a' : [4,4,3,4,0], 'b':[0,2,0,1,0], 'c':[4,0,0,4,3], 'd':[3,2,0,3,1], 'e':[0,0,2,1,0], 'f':[2,0,3,0,0], 'g':[2,3,0,0,1], 'h':[3,0,0,1,3], 'i':[0,3,3,1,0] }
score_df = pd.DataFrame(score_dict, columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])
score_df.index = ['sub1', 'sub2', 'sub3', 'sub4', 'sub5']
# brand_df output
# Brand Value
# 0 a 2
# 1 b 1
# 2 c 3
# 3 g 1
# 4 e 4
# 5 i 1
# 6 d 3
# 7 h 2
# 8 f 3
# score_df output
# a b c d e f g h i
# sub1 4 0 4 3 0 2 2 3 0
# sub2 4 2 0 2 0 0 3 0 3
# sub3 3 0 0 0 2 3 0 0 3
# sub4 4 1 4 3 1 0 0 1 1
# sub5 0 0 3 1 0 0 1 3 0
What is being done :-
- Pick only the top brands that make 75% of the cumulative value
brand_df.sort_values("Value", axis=0, ascending=False, inplace=True, na_position='first') # ordered Desc to get biggest brands
brand_df['cum_percent'] = (brand_df.Value.cumsum() / brand_df.Value.sum()) * 100
# brand_df Output
# Brand Value cum_percent
# e 4 20.0
# c 3 35.0
# d 3 50.0
# f 3 65.0
# a 2 75.0
# h 2 85.0
# b 1 90.0
# g 1 95.0
# i 1 100.0
selbrand = []
for index, row in brand_df.iterrows():
if row['cum_percent'] <= 75:
selbrand.append(row['Brand'])
# selbrand output
# ['e', 'c', 'd', 'f', 'a']
- Pick the subjects where 75% of the selected brand has a score (i.e > 0)
# Setting threshold for subject selection. [75% percent of the selected brands must have a score]
threshold = math.trunc(len(selbrand)*0.75) # In this case 75% of 5 is 3.75 and hence rounded to 3
selsub = []
for index, row in score_df.iterrows():
count = 0
for col in selbrand:
if row[col] > 0:
count += 1
if count >= threshold:
selsub.append(index)
I get the below output using the above scripts.
print(selsub)
['sub1', 'sub3', 'sub4'] # Only these subjects have score for at least 3 of the selected brands
What is a more efficient or shorter way to achieve this?