1

I am working with a panda dataframe where I have lists as individual cell elements (for few columns). I want to check a condition for every element in a list in one column and select the corresponding list element from other column. I know this can be easily done using zip command by something like:

p = 5 ; q = 6; DF['Column3'] = [[b for a, b in zip(x, y) if a > p and a <q ] for x, y in zip(DF['Column1'], DF['Column2'])] However I am not sure how to work with percentiles here i.e. instead of a fixed p and q I want to use some percentile value of the list (say 50 (equivalent to p) percentile to 90 percentile (equivalent to q)).

So for every cell (each cell consisting a list) in a column it should calculate the precentile values and check for corresponding list elements from other list (from corresponding cell) in the other column.

To explain the problem with example (let's say DF):

A B Column1 Column2
0 3.4 5.7 [2.1, 2.9, 5.2, 6.8] [2.5,3.4,1.2,5.1]
1 4 1.7 [1.1, 2.5, 5.6, 11.5, 15.6, 21.5] [12.15,1.58,5.4,1.2,34.2,67.2]

50-Percentile DF['Column1'][0] is 4.05 and 90 percentile is 6.32. So, in Column1 third value satisfies this condition. Corresponding to this value in Column2 is 1.2. Hence Column3 should have an output list with 1.2. Similar procedure for next row too (p = 8.55, q =18.55) :

A B Column1 Column2 Column3
0 3.4 5.7 [2.1, 2.9, 5.2, 6.8] [2.5,3.4,1.2,5.1] [1.2]
1 4 1.7 [1.1, 2.5, 5.6, 11.5, 15.6, 21.5] [12.15,1.58,5.4,1.2,34.2,67.2] [1.2,34.2]
PKL
  • 55
  • 5

1 Answers1

1

You can use numpy.percentile to get the two values for your range. Then, use list comprehension across columns (by passing axis=1).

As a one liner, you can do:

df['Column3'] = (df.assign(Column3=df['Column1'].apply(lambda x: np.percentile(x, [50, 90])))
                   .apply(lambda x: [b for (a,b) in zip(x['Column1'], x['Column2']) 
                                     if x['Column3'][0] < a < x['Column3'][1]], axis=1))

Breaking the steps into more detail:

df = pd.DataFrame(
{'A' : [3.4,4],
'B' : [5.7, 1.7],
'Column1' : [[2.1, 2.9, 5.2, 6.8], [1.1, 2.5, 5.6, 11.5, 15.6, 21.5]],
'Column2' : [[2.5,3.4,1.2,5.1],[12.15,1.58,5.4,1.2,34.2,67.2]]})
df['Column3'] = df['Column1'].apply(lambda x: np.percentile(x, 50))
df['Column4'] = df['Column1'].apply(lambda x: np.percentile(x, 90))
df['Column5'] = df.apply(lambda x: [b for (a,b) in zip(x['Column1'], x['Column2']) 
                                    if x['Column3'] < a < x['Column4']], axis=1)
df
Out[1]: 
     A    B                            Column1  \
0  3.4  5.7               [2.1, 2.9, 5.2, 6.8]   
1  4.0  1.7  [1.1, 2.5, 5.6, 11.5, 15.6, 21.5]   

                               Column2  Column3  Column4      Column5  
0                 [2.5, 3.4, 1.2, 5.1]     4.05     6.32        [1.2]  
1  [12.15, 1.58, 5.4, 1.2, 34.2, 67.2]     8.55    18.55  [1.2, 34.2]  

From there, you can do:

 df = df.drop(['Column3', 'Column4'], axis=1)
David Erickson
  • 16,433
  • 2
  • 19
  • 35