Using zip and percentile together - Python

Question

I am working with a panda dataframe where I have lists as individual cell elements (for few columns). I want to check a condition for every element in a list in one column and select the corresponding list element from other column. I know this can be easily done using zip command by something like:

p = 5 ; q = 6; DF['Column3'] = [[b for a, b in zip(x, y) if a > p and a <q ] for x, y in zip(DF['Column1'], DF['Column2'])] However I am not sure how to work with percentiles here i.e. instead of a fixed p and q I want to use some percentile value of the list (say 50 (equivalent to p) percentile to 90 percentile (equivalent to q)).

So for every cell (each cell consisting a list) in a column it should calculate the precentile values and check for corresponding list elements from other list (from corresponding cell) in the other column.

To explain the problem with example (let's say DF):

	A	B	Column1	Column2
0	3.4	5.7	[2.1, 2.9, 5.2, 6.8]	[2.5,3.4,1.2,5.1]
1	4	1.7	[1.1, 2.5, 5.6, 11.5, 15.6, 21.5]	[12.15,1.58,5.4,1.2,34.2,67.2]

50-Percentile DF['Column1'][0] is 4.05 and 90 percentile is 6.32. So, in Column1 third value satisfies this condition. Corresponding to this value in Column2 is 1.2. Hence Column3 should have an output list with 1.2. Similar procedure for next row too (p = 8.55, q =18.55) :

	A	B	Column1	Column2	Column3
0	3.4	5.7	[2.1, 2.9, 5.2, 6.8]	[2.5,3.4,1.2,5.1]	[1.2]
1	4	1.7	[1.1, 2.5, 5.6, 11.5, 15.6, 21.5]	[12.15,1.58,5.4,1.2,34.2,67.2]	[1.2,34.2]

Please include sample input data and output data (no images). Please see: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples on how you can ask a good pandas question that is reproducible. — David Erickson, Jan 19 '21 at 20:53

David Erickson · Accepted Answer · 2021-01-19T21:50:29.010

You can use numpy.percentile to get the two values for your range. Then, use list comprehension across columns (by passing axis=1).

As a one liner, you can do:

df['Column3'] = (df.assign(Column3=df['Column1'].apply(lambda x: np.percentile(x, [50, 90])))
                   .apply(lambda x: [b for (a,b) in zip(x['Column1'], x['Column2']) 
                                     if x['Column3'][0] < a < x['Column3'][1]], axis=1))

Breaking the steps into more detail:

df = pd.DataFrame(
{'A' : [3.4,4],
'B' : [5.7, 1.7],
'Column1' : [[2.1, 2.9, 5.2, 6.8], [1.1, 2.5, 5.6, 11.5, 15.6, 21.5]],
'Column2' : [[2.5,3.4,1.2,5.1],[12.15,1.58,5.4,1.2,34.2,67.2]]})
df['Column3'] = df['Column1'].apply(lambda x: np.percentile(x, 50))
df['Column4'] = df['Column1'].apply(lambda x: np.percentile(x, 90))
df['Column5'] = df.apply(lambda x: [b for (a,b) in zip(x['Column1'], x['Column2']) 
                                    if x['Column3'] < a < x['Column4']], axis=1)
df
Out[1]: 
     A    B                            Column1  \
0  3.4  5.7               [2.1, 2.9, 5.2, 6.8]   
1  4.0  1.7  [1.1, 2.5, 5.6, 11.5, 15.6, 21.5]   

                               Column2  Column3  Column4      Column5  
0                 [2.5, 3.4, 1.2, 5.1]     4.05     6.32        [1.2]  
1  [12.15, 1.58, 5.4, 1.2, 34.2, 67.2]     8.55    18.55  [1.2, 34.2]

From there, you can do:

 df = df.drop(['Column3', 'Column4'], axis=1)

Using zip and percentile together - Python

1 Answers1