1

I'm trying to process data from three (csv) files say p, c, f:

  • In p, each row has labels
  • In c, each row has scores for labels in corresponding row in p (p is matched to c)
  • In f, each row is a label and another score

For e.g., loaded into df_p, df_c and df_f respectively:

>>> df_p
       p1   p2   p3   p4   p5
2614  104  104  102  102  102
3735  100  103  101  100  104
1450  100  102  100  102  102
>>> df_c
            c1        c2        c3        c4        c5
2614  0.338295  0.190882  0.157231  0.135776  0.177816
3735  0.097800  0.124296  0.268475  0.265111  0.244319
1450  0.160922  0.403703  0.122390  0.130612  0.182373
>>> df_f
            c
100  0.183946
101  0.290311
102  0.192049
103  0.725704
104  0.143359

Algo

For each row in df_p, df_c: 
  1. update each score in df_c row with df_c * df_f[label] where label is from p
  2. reorder elements of df_c in descending scores
  3. reorder elements in df_p with order from df_c

For eg, the first calculated cell in df_c will be 0.338295*0.143359

this is the code I have that's working albeit very very slowly:

np_p = []
np_c = []
for i in range(len(df_p)):

    ## determine revised scores
    # Step 1. Revise scores
    r_conf = df_c.iloc[[i]].values[0] # scores for row
    r_place_id = df_p.iloc[[i]].values[0] # labels for row
    p_c = df_f.ix[r_place_id].c.values # class conf for labels
    t_conf = r_conf*p_c # total score

    # Reorder labels
    # Step 2. reorder by revised score
    c = np.sort(t_conf)[::-1]
    c_sort = np.argsort(t_conf)[::-1] 
    # Step 3. reorder labels with revised score order
    p_sort = df_p.iloc[[i]][df_p.columns[c_sort]].values
    np_c.append(c)
    np_p.append(p_sort)

Ideally I'd like to create a dataframe like df_p and df_c but with the reordered and revised values (in np_p and np_c).

Any ideas on how I can make this go faster.

Thanks!!!

kip6000
  • 85
  • 1
  • 6

2 Answers2

2

You could use the DateFrame.replace method to replace the values in df_p with values from df_f:

In [124]: df_pf = df_p.replace(df_f['c']); df_pf
Out[124]: 
            p1        p2        p3        p4        p5
2614  0.143359  0.143359  0.192049  0.192049  0.192049
3735  0.183946  0.725704  0.290311  0.183946  0.143359
1450  0.183946  0.192049  0.183946  0.192049  0.192049

Since Pandas aligns indices before multiplying two DataFrames, if we strip the ps and cs off the column labels, then we could obtain the desired products using df_pf.mul(df_c):

df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
df_c = df_pf.mul(df_c)

The correct order for the columns for each row can be obtained using np.argsort with axis=1 specified. The order array returned by np.argsort can then be used to reorder df_c and df_p:

order = np.argsort(-df_c.values, axis=1)
nrows, ncols = df_c.shape
np_c = df_c.values[np.arange(nrows)[:,None], order]
np_p = df_p.values[np.arange(nrows)[:,None], order]

The above uses NumPy's advanced integer indexing to reorder the values in each row separately.


import numpy as np
import pandas as pd

df_p = pd.DataFrame({'p1': [104, 100, 100],
 'p2': [104, 103, 102],
 'p3': [102, 101, 100],
 'p4': [102, 100, 102],
 'p5': [102, 104, 102]}, index=[2614,3735,1450])

df_c = pd.DataFrame({'c1': [0.33829499999999996, 0.097799999999999998, 0.16092200000000001],
 'c2': [0.190882, 0.124296, 0.40370300000000003],
 'c3': [0.15723099999999998, 0.26847500000000002, 0.12239000000000001],
 'c4': [0.13577600000000001, 0.26511099999999999, 0.13061199999999998],
 'c5': [0.177816, 0.24431900000000001, 0.18237300000000001]}, index=[2614,3735,1450])

df_f = pd.DataFrame({'c': [0.183946,
  0.29031099999999999,
  0.192049,
  0.72570400000000002,
  0.14335899999999999]}, index=list(range(100,105)))

def using_pandas(df_p, df_c, df_f):
    # this works no matter the order of the columns and rows of `df_p` and `df_c`.
    # aligns `df_p` and `df_c` based on the numeric part of their column names
    df_pf = df_p.replace(df_f['c'])
    # change the column names to match since Pandas will align the indices before multiplying
    df_pf.columns = df_pf.columns.str.extract(r'(\d+)', expand=False)
    df_c.columns = df_c.columns.str.extract(r'(\d+)', expand=False)
    df_c = df_pf.mul(df_c)
    order = np.argsort(-df_c.values, axis=1)
    nrows, ncols = df_c.shape
    np_c = df_c.values[np.arange(nrows)[:,None], order]
    np_p = df_p.values[np.arange(nrows)[:,None], order]
    return np_c, np_p

np_c, np_p = using_pandas(df_p, df_c, df_f)
print(np_c)
print(np_p)

yields

[[ 0.04849763  0.03414938  0.03019606  0.02736465  0.02607565]
 [ 0.0902021   0.07794125  0.04876611  0.03502533  0.01798992]
 [ 0.07753076  0.03502455  0.02960096  0.0250839   0.02251315]]
[[104 102 102 104 102]
 [103 101 100 104 100]
 [102 102 100 102 100]]

Alternatively, if the columns and rows of df_p and df_c are already aligned, then you can gain a bit more speed by doing the multiplication in NumPy instead of Pandas:

def using_numpy(df_p, df_c, df_f):
    # faster than using_pandas, but assumes `df_p` and `df_c` are already aligned
    df_pf = df_p.replace(df_f['c'])
    df_pf = df_pf.values
    df_c = df_c.values
    df_p = df_p.values
    df_c = df_pf * df_c
    order = np.argsort(-df_c, axis=1)
    nrows, ncols = df_c.shape
    np_c = df_c[np.arange(nrows)[:,None], order]
    np_p = df_p[np.arange(nrows)[:,None], order]
    return np_c, np_p

For these small DataFrames, using_numpy is slightly faster than using_pandas. The difference in speed would be more pronounced if the DataFrames were larger. But again, note that using_numpy relies on the indices already being aligned.

In [138]: %timeit using_numpy(df_p, df_c, df_f)
1000 loops, best of 3: 1.15 ms per loop

In [139]: %timeit using_pandas(df_p, df_c, df_f)
1000 loops, best of 3: 1.62 ms per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • .@unutbu. can you explain np.argsort(-df_c, axis=1)? – Merlin Jul 02 '16 at 21:36
  • @Merlin: [`np.argsort`](http://stackoverflow.com/q/17901218/190597) returns the indices that would sort an array. For example, `np.argsort([1,3,2])` returns `array([0, 2, 1])` since to sort `[1,3,2]` in ascending order the 0th item comes first, then the 2nd, then the 1st. `np.argsort([[1,3,2],[20,30,10]], axis=1)` is similar to row_stacking `np.argsort([1,3,2])` and `np.argsort([20,30,10])`. A minus sign is used in `np.argsort(-df_c, axis=1)` to argsort each row of `df_c` in *descending order*. – unutbu Jul 02 '16 at 21:43
  • @unutbu - works great, complete and good intuition! I had to break the dataframe into chunks to fit into memory since my data is much larger than the sample above but otherwise very succinct and elegant and I learnt a few tricks. thank you. – kip6000 Jul 02 '16 at 22:49
0

Try this: First create a dict of df_f

di = df_f['c'].to_dict()

   {100: 0.183946,
 101: 0.29031099999999999,
 102: 0.192049,
 103: 0.72570400000000002,
 104: 0.14335899999999999}

Then map it to df_p:

df_p.replace(di)

#            p1        p2        p3        p4        p5
# 2614  0.143359  0.143359  0.192049  0.192049  0.192049
# 3735  0.183946  0.725704  0.290311  0.183946  0.143359
# 1450  0.183946  0.192049  0.183946  0.192049  0.192049

Then do your multiplication:

df_c2 =df_c.copy() 
df_c2['c1'] = df_c['c1']* df_p['p1'] 
df_c2['c2'] = df_c['c2']* df_p['p2'] 
df_c2['c3'] = df_c['c3']* df_p['p3'] 
df_c2['c4'] = df_c['c4']* df_p['p4'] 
df_c2['c5'] = df_c['c5']* df_p['p5'] 

 #              c1        c2        c3        c4        c5
 #   2614  0.048498  0.027365  0.030196  0.026076  0.034149
 #   3735  0.017990  0.090202  0.077941  0.048766  0.035025
 #   1450  0.029601  0.077531  0.022513  0.025084  0.035025
Merlin
  • 24,552
  • 41
  • 131
  • 206