Pandas: how to select rows in data frame based on condition of a specific value on a specific column

Question

I have a given data frame as below example:

            0  1      2      3      4      5        6        7        8  
0      842517  M  20.57  17.77  132.9   1326  0.08474  0.07864   0.0869   
1    84300903  M  19.69  21.25    130   1203   0.1096   0.1599   0.1974   
2    84348301  M  11.42  20.38  77.58  386.1   0.1425   0.2839   0.2414   
3      843786  M  12.45   15.7  82.57  477.1   0.1278     0.17   0.1578   
4      844359  M  18.25  19.98  119.6   1040  0.09463    0.109   0.1127

And I wrote a function that should split the dataset into 2 data frames, based on comparison of a value in a specific column and a specific value. For example, if I have col_idx = 2 and value=18.3 the result should be:

df1 - below the value:

            0  1      2      3      4      5        6        7        8    
2    84348301  M  11.42  20.38  77.58  386.1   0.1425   0.2839   0.2414   
3      843786  M  12.45   15.7  82.57  477.1   0.1278     0.17   0.1578   
4      844359  M  18.25  19.98  119.6   1040  0.09463    0.109   0.1127

df2 - above the value:

            0  1      2      3      4      5        6        7        8  
0      842517  M  20.57  17.77  132.9   1326  0.08474  0.07864   0.0869   
1    84300903  M  19.69  21.25    130   1203   0.1096   0.1599   0.1974

The function should look like:

def split_dataset(data_set, col_idx, value):
    below_df = ?
    above_df = ?
    return below_df, above_df

Can anybody complete my script please?

score 1 · Accepted Answer · answered Sep 08 '19 at 10:04

1

below_df = data_set[data_set[col_idx] < value]
above_df = data_set[data_set[col_idx] > value]  # you have to deal with data_set[col_idx] == value though

answered Sep 08 '19 at 10:04

Zachary822

2,873
2
11
9

ansev · Answer 2 · 2019-09-08T11:33:14.593

You can use loc:

def split_dataset(data_set, col_idx, value):
    below_df = df.loc[df[col_idx]<=value]
    above_df = df.loc[df[col_idx]>=value]
    return below_df, above_df
df1,df2=split_dataset(df,'2',18.3)

Output:

df1

          0  1      2      3       4       5        6       7       8
2  84348301  M  11.42  20.38   77.58   386.1  0.14250  0.2839  0.2414
3    843786  M  12.45  15.70   82.57   477.1  0.12780  0.1700  0.1578
4    844359  M  18.25  19.98  119.60  1040.0  0.09463  0.1090  0.1127

df2
          0  1      2      3      4       5        6        7       8
0    842517  M  20.57  17.77  132.9  1326.0  0.08474  0.07864  0.0869
1  84300903  M  19.69  21.25  130.0  1203.0  0.10960  0.15990  0.1974

Note:

Note that in this function call the names of the columns are numbers. You have to know before calling the function the correct type of columns. You may have to use string type or not.

You should also define what happens if the value with which the data frame is divided (value) is included in the column of the data frame.

Pandas: how to select rows in data frame based on condition of a specific value on a specific column

2 Answers2