2

I'd like to create a lookup table from a dataframe based on multiple conditions. I have the following df:

N = 100
L = ['AR1', 'PO1', 'RU1']

np.random.seed(0)

df3 = pd.DataFrame(
    {'X':np.random.uniform(1,4,N),
     'Y':np.random.uniform(1,4,N),
     'Z':np.random.uniform(1,4,N),
     'LG':np.random.choice(L,N),
    })

df3['bins_X'] = df3.groupby('LG')['X'].apply(pd.qcut, q=5, labels=np.arange(5))
df3['bins_Y'] = df3.groupby('LG')['Y'].apply(pd.qcut, q=5, labels=np.arange(5))
df3['bins_Z'] = df3.groupby('LG')['Z'].apply(pd.qcut, q=5, labels=np.arange(5))
df3['bins_X_int'] = df3.groupby('LG')['X'].apply(pd.qcut, q=5)
df3['bins_Y_int'] = df3.groupby('LG')['Y'].apply(pd.qcut, q=5)
df3['bins_Z_int'] = df3.groupby('LG')['Z'].apply(pd.qcut, q=5)

df3.head()

enter image description here

From which I want to create the following lookup_table:

enter image description here

So, grouped by 'LG' and the bins ranked from 0 to 4. What I need is the example lookup_table filled out with the associating bin_intervals from the dataframe.

Zanshin
  • 1,262
  • 1
  • 14
  • 30

2 Answers2

2

IIUC:

def get_ints(s, q):
    return pd.Series(pd.qcut(s, q).sort_values().unique())

d1 = df3.set_index('LG')[list('XYZ')].stack()
g = d1.groupby(level=[0, 1])
g.apply(get_ints, q=5).unstack(1).rename_axis(['LG', 'bin_number']).reset_index()

     LG  bin_number                X                Y                Z
0   AR1           0   [1.306, 1.926]  [1.0556, 1.875]  [1.0493, 1.819]
1   AR1           1   (1.926, 2.447]   (1.875, 2.757]   (1.819, 2.595]
2   AR1           2   (2.447, 2.812]  (2.757, 3.0724]    (2.595, 2.95]
3   AR1           3  (2.812, 3.0744]  (3.0724, 3.376]    (2.95, 3.334]
4   AR1           4  (3.0744, 3.936]   (3.376, 3.803]   (3.334, 3.885]
5   PO1           0  [1.0564, 1.286]  [1.0955, 1.566]   [1.074, 1.596]
6   PO1           1   (1.286, 1.868]   (1.566, 1.911]   (1.596, 1.895]
7   PO1           2   (1.868, 2.682]   (1.911, 2.669]   (1.895, 2.426]
8   PO1           3    (2.682, 3.29]   (2.669, 2.958]   (2.426, 3.283]
9   PO1           4    (3.29, 3.965]   (2.958, 3.676]   (3.283, 3.848]
10  RU1           0  [1.0141, 1.452]  [1.0351, 2.158]  [1.0397, 1.632]
11  RU1           1   (1.452, 1.983]    (2.158, 2.49]   (1.632, 2.223]
12  RU1           2   (1.983, 2.622]   (2.49, 3.0893]  (2.223, 3.0732]
13  RU1           3   (2.622, 3.226]  (3.0893, 3.673]  (3.0732, 3.729]
14  RU1           4   (3.226, 3.929]   (3.673, 3.997]   (3.729, 3.971]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks piR, a question though. You use qcut once again for this new table, does this one return exactly the same result as in the original qcut which I've used to create the original dataframe? – Zanshin Feb 11 '17 at 12:59
  • One other question, maybe you haven't noticed, but I have added a new question: http://stackoverflow.com/questions/42094873/incorrect-results-when-applying-solution-to-real-data, regarding the usage of your solution to another one. Could you take a look at it? – Zanshin Feb 11 '17 at 13:01
  • Yes! That means you can use the look up table to fill out the rest of `df3`. You call it 6 times. This way, you only need to call it 3 times. I'll keep working on showing that aspect. – piRSquared Feb 11 '17 at 13:01
  • Yes, I added the ints in addition to the bin_numbers, trying to figure out how to lookup an int belonging to a bin_number. I can forego on the ints now with your solution here. Note: The bin_numbers I need in functions, the ints I need to lookup for subsequent functions. In short, is a figure in a int and what is the associating bin_number. I will probably be posting another question for that in the future. – Zanshin Feb 11 '17 at 13:08
  • I have added a bounty to the question mentioned above. It concerns the solution you provided earlier. Something you might solve? – Zanshin Feb 13 '17 at 10:40
1

IIUC you can do it this way:

In [55]: lkp = df3[['LG']].sort_values('LG').copy()

In [56]: lkp['bin_number'] = lkp.groupby('LG').cumcount()

In [57]: lkp
Out[57]:
     LG  bin_number
0   AR1           0
46  AR1           1
25  AR1           2
26  AR1           3
57  AR1           4
28  AR1           5
29  AR1           6
56  AR1           7
31  AR1           8
32  AR1           9
..  ...         ...
45  RU1          24
98  RU1          25
55  RU1          26
58  RU1          27
60  RU1          28
61  RU1          29
63  RU1          30
64  RU1          31
39  RU1          32
99  RU1          33

[100 rows x 2 columns]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • I'm sorry, I don't understand. I need to fill out the example lookup_table with the associating bin intervals. Besides that, there are only 5 bin_numbers per 'LG', you have a lot more? I'll update the question to be more clear. – Zanshin Feb 11 '17 at 12:38