I'm trying to process data from three (csv) files say p, c, f:
- In p, each row has labels
- In c, each row has scores for labels in corresponding row in p (p is matched to c)
- In f, each row is a label and another score
For e.g., loaded into df_p, df_c and df_f respectively:
>>> df_p
p1 p2 p3 p4 p5
2614 104 104 102 102 102
3735 100 103 101 100 104
1450 100 102 100 102 102
>>> df_c
c1 c2 c3 c4 c5
2614 0.338295 0.190882 0.157231 0.135776 0.177816
3735 0.097800 0.124296 0.268475 0.265111 0.244319
1450 0.160922 0.403703 0.122390 0.130612 0.182373
>>> df_f
c
100 0.183946
101 0.290311
102 0.192049
103 0.725704
104 0.143359
Algo
For each row in df_p, df_c:
1. update each score in df_c row with df_c * df_f[label] where label is from p
2. reorder elements of df_c in descending scores
3. reorder elements in df_p with order from df_c
For eg, the first calculated cell in df_c
will be 0.338295*0.143359
this is the code I have that's working albeit very very slowly:
np_p = []
np_c = []
for i in range(len(df_p)):
## determine revised scores
# Step 1. Revise scores
r_conf = df_c.iloc[[i]].values[0] # scores for row
r_place_id = df_p.iloc[[i]].values[0] # labels for row
p_c = df_f.ix[r_place_id].c.values # class conf for labels
t_conf = r_conf*p_c # total score
# Reorder labels
# Step 2. reorder by revised score
c = np.sort(t_conf)[::-1]
c_sort = np.argsort(t_conf)[::-1]
# Step 3. reorder labels with revised score order
p_sort = df_p.iloc[[i]][df_p.columns[c_sort]].values
np_c.append(c)
np_p.append(p_sort)
Ideally I'd like to create a dataframe like df_p
and df_c
but with the reordered and revised values (in np_p
and np_c
).
Any ideas on how I can make this go faster.
Thanks!!!