How to extract a part of column based on two other columns from a python dataframe

Question

df = pd.DataFrame(np.array([[[1740, 6920, 10120, 14300, 18220, 24500, 41300], 10000, 20000], [[1620, 5840, 12100, 15000, 25260, 26020], 5900, 15200]]),
                   columns=['long_list', 'min', 'max'])

For this dataframe, I'm hoping to create a new column df['part'] that is the part of df['long_list'] that meets the condition of df['min']<df['part']<df['max']. I tried to use a lambda function but struggled with how to use all three columns. So the output would be

df2=pd.DataFrame(np.array([[[1740, 6920, 10120, 14300, 18220, 24500, 41300], 10000, 20000, [10120, 14300, 18220]], [[1620, 5840, 12100, 15000, 25260, 26020], 5900, 15200,[12100, 15000]]]),
                   columns=['long_list', 'min', 'max','part'])

score 4 · Answer 1 · answered Jan 06 '21 at 22:35

You can explode the long_list, query on the condition, and group back:

df['part'] = (df.explode('long_list')
                .query('min<long_list<max')
                .groupby(level=0)['long_list'].agg(list)
             )

Output:

    long_list                                          min    max  part
--  -----------------------------------------------  -----  -----  ---------------------
 0  [1740, 6920, 10120, 14300, 18220, 24500, 41300]  10000  20000  [10120, 14300, 18220]
 1  [1620, 5840, 12100, 15000, 25260, 26020]          5900  15200  [12100, 15000]

Dani Mesejo · Answer 2 · 2021-01-06T22:36:05.720

You could keep everything in pandas by using explode + between and then groupby:

# explode
exploded = df2.explode('long_list')

# filter with  between
mask = exploded['long_list'].between(exploded['min'], exploded['max'])
filtered = exploded[mask]

# group filtered result
df3 = df2.assign(part= filtered.groupby(level=0)['long_list'].agg(list))
print(df3)

Output

                                         long_list  ...                   part
0  [1740, 6920, 10120, 14300, 18220, 24500, 41300]  ...  [10120, 14300, 18220]
1         [1620, 5840, 12100, 15000, 25260, 26020]  ...         [12100, 15000]

[2 rows x 4 columns]

zabop · Accepted Answer · 2021-01-06T22:33:31.840

import pandas as pd

You can create this new column using apply() and a conditional list comprehension:

df2['part']=df2.apply(lambda row:
                      [each for each in row['long_list'] 
                       if each>row['min'] and each<row['max']],axis=1)

If you really want the result to be a different dataframe, then:

df2=df
df2['part']=df2.apply(lambda row:
                      [each for each in row['long_list'] 
                       if each>row['min'] and each<row['max']],axis=1)

score 0 · Answer 4 · answered Jan 06 '21 at 22:29

0

here is a function to do what you want

df2['part'] = []
for index, row in df2.iterrow():
    part = []
    for num in row['long_list']:
        if num > row['min'] & num < row['max']:
            part = part.append(num)
    df2.loc[index,'part'] = part

answered Jan 06 '21 at 22:29

Paul Brennan

2,638
4
19
26

score 0 · Answer 5 · answered Jan 06 '21 at 22:49

Working with lists/sequence types within Pandas is not efficient; a list comprehension would do fine, similar to @zabop's answer:

zip_min_max = zip(df["min"], df["max"])
zipped = zip(df.long_list, zip_min_max)

df["part"] = [[val for val in left 
              if right_a < val < right_b]
              for left, (right_a, right_b) 
              in zipped]

How to extract a part of column based on two other columns from a python dataframe

5 Answers5