I have a dictionary with a set of integers.
{'A': {9, 203, 404, 481},
'B': {9},
'C': {110},
'D': {9, 314, 426},
'E': {59, 395, 405}
}
You can generate the data with this:
data = {}
for i in string.ascii_uppercase:
n = 25
rng = np.random.default_rng()
data[i] = set(rng.choice(100, size=n, replace=False))
I need to get a list of the intersect of subsets of the dictionary. So here in example the output of the intersect of ['A','B','D'] would return [9]
I've figured out 2 different ways of doing this but both are much to slow when the sets grow in value.
cols = ['A','B','D']
# method 1
lis = list(map(data.get, cols))
idx = list(set.intersection(*lis))
#method 2 (10x slower then method 1)
query_dict = dict((k, data[k]) for k in cols)
idx2 = list(reduce(set.intersection, (set(val) for val in query_dict.values())))
When the sets grow (>10k ints per set) the runtime grows quickly.
I'm okay with using other datatypes then sets in the dict like lists or numpy arrays etc.
Is there a faster way of accomplishing this?
EDIT:
The original problem I had was this dataframe:
T S A B C D
0 49.378 1.057 AA AB AA AA
1 1.584 1.107 BC BA AA AA
2 1.095 0.000 BB BB AD
3 10.572 1.224 BA AB AA AA
4 0.000 0.000 DC BA AB
For each row I have to sum 'T' over all rows which have A,B,C,D in common, if a threshold is reached continue else over B,C,D in common, then C,D and then only D if threshold still not reached.
However this was really slow, so first I tried with get_dummies and then take product of columns. However this was to slow so I moved to numpy arrays with indices to sum over. That is the fastest option up till now, however the intersect is the only things which still takes op too much time to compute.
EDIT2:
It turned out I was making it to hard on myself and it is possible with pandas groupby and that is very fast.
code:
parts = [['A','B','C','D'],['B','C','D'],['C','D'],['D']]
for part in parts:
temp_df = df.groupby(part,as_index=False).sum()
temp_df = temp_df[temp_df['T'] > 100]
df = pd.merge(df,temp_df,on=part,how='left',suffixes=["","_" + "".join(part)])
df['T_sum'] = df[['T_ABCD','T_BCD','T_CD','T_D']].min(axis=1)
df['S_sum'] = df[['S_ABCD','S_BCD','S_CD','S_D']].min(axis=1)
df.drop(['T_ABCD','T_BCD','T_CD','T_D','S_ABCD','S_BCD','S_CD','S_D'],, axis=1, inplace=True)
probably the code can be a bit cleaner, but I don't know how to replace only NaN values in a merge.