1

I have a given data frame as below example:

            0  1      2      3      4      5        6        7        8  
0      842517  M  20.57  17.77  132.9   1326  0.08474  0.07864   0.0869   
1    84300903  M  19.69  21.25    130   1203   0.1096   0.1599   0.1974   
2    84348301  M  11.42  20.38  77.58  386.1   0.1425   0.2839   0.2414   
3      843786  M  12.45   15.7  82.57  477.1   0.1278     0.17   0.1578   
4      844359  M  18.25  19.98  119.6   1040  0.09463    0.109   0.1127  

And I wrote a function that should split the dataset into 2 data frames, based on comparison of a value in a specific column and a specific value. For example, if I have col_idx = 2 and value=18.3 the result should be:

df1 - below the value:

            0  1      2      3      4      5        6        7        8    
2    84348301  M  11.42  20.38  77.58  386.1   0.1425   0.2839   0.2414   
3      843786  M  12.45   15.7  82.57  477.1   0.1278     0.17   0.1578   
4      844359  M  18.25  19.98  119.6   1040  0.09463    0.109   0.1127 

df2 - above the value:

            0  1      2      3      4      5        6        7        8  
0      842517  M  20.57  17.77  132.9   1326  0.08474  0.07864   0.0869   
1    84300903  M  19.69  21.25    130   1203   0.1096   0.1599   0.1974   

The function should look like:

def split_dataset(data_set, col_idx, value):
    below_df = ?
    above_df = ?
    return below_df, above_df

Can anybody complete my script please?

Bella
  • 937
  • 1
  • 13
  • 25

2 Answers2

1
below_df = data_set[data_set[col_idx] < value]
above_df = data_set[data_set[col_idx] > value]  # you have to deal with data_set[col_idx] == value though
Zachary822
  • 2,873
  • 2
  • 11
  • 9
1

You can use loc:

def split_dataset(data_set, col_idx, value):
    below_df = df.loc[df[col_idx]<=value]
    above_df = df.loc[df[col_idx]>=value]
    return below_df, above_df
df1,df2=split_dataset(df,'2',18.3)

Output:

df1

          0  1      2      3       4       5        6       7       8
2  84348301  M  11.42  20.38   77.58   386.1  0.14250  0.2839  0.2414
3    843786  M  12.45  15.70   82.57   477.1  0.12780  0.1700  0.1578
4    844359  M  18.25  19.98  119.60  1040.0  0.09463  0.1090  0.1127

df2
          0  1      2      3      4       5        6        7       8
0    842517  M  20.57  17.77  132.9  1326.0  0.08474  0.07864  0.0869
1  84300903  M  19.69  21.25  130.0  1203.0  0.10960  0.15990  0.1974

Note:

Note that in this function call the names of the columns are numbers. You have to know before calling the function the correct type of columns. You may have to use string type or not.


You should also define what happens if the value with which the data frame is divided (value) is included in the column of the data frame.

ansev
  • 30,322
  • 5
  • 17
  • 31