6

I have this dataframe:

+------+--------------+------------+
| ID   | Education    |      Score | 
+------+--------------+------------+
|    1 |  High School |      7.884 |     
|    2 |  Bachelors   |      6.952 |     
|    3 |  High School |      8.185 |   
|    4 |  High School |      6.556 | 
|    5 |  Bachelors   |      6.347 | 
|    6 |  Master      |      6.794 |   
+------+--------------+------------+

I want to create a new column which is categorizing the score column. I want to label it as: 'bad', 'good', 'very good'.

Which maybe would look like this:

+------+--------------+------------+------------+
| ID   | Education    |      Score | Labels     |
+------+--------------+------------+------------+
|    1 |  High School |      7.884 | Good       |
|    2 |  Bachelors   |      6.952 | Bad        |
|    3 |  High School |      8.185 | Very good  |   
|    4 |  High School |      6.556 | Bad        |
|    5 |  Bachelors   |      6.347 | Bad        |
|    6 |  Master      |      6.794 | Bad        |
+------+--------------+------------+------------+

How can I do that?

Thanks in advance

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
ebuzz168
  • 1,134
  • 2
  • 17
  • 39

3 Answers3

10
import pandas as pd 

# initialize list of lists 
data = [[1,'High School',7.884], [2,'Bachelors',6.952], [3,'High School',8.185], [4,'High School',6.556],[5,'Bachelors',6.347],[6,'Master',6.794]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['ID', 'Education', 'Score']) 

df['Labels'] = ['Bad' if x<7.000 else 'Good' if 7.000<=x<8.000 else 'Very Good' for x in df['Score']]
df

    ID  Education    Score    Labels
0   1   High School  7.884    Good
1   2   Bachelors    6.952    Bad
2   3   High School  8.185    Very Good
3   4   High School  6.556    Bad
4   5   Bachelors    6.347    Bad
5   6   Master       6.794    Bad
Prathik Kini
  • 1,067
  • 11
  • 25
  • 5
    just a tip: `df['labels']=np.select([df['Score']<7,df['Score'].between(7,8)],['Bad','Good'],'Very Good')` , `np.select` would work in a vectorized way so faster :) – anky Jan 08 '20 at 10:02
7

I suppose it is the score you would like to map to the labels. You could define a mapping function taking score as input and then returning the label:

def map_score(score):
  if score >= 8:
    return "Very good"
  elif score >= 7:
    return "Good"
  else:
    return "Bad"

df["Labels"] = df["Score"].apply(lambda score: map_score(score))
Asgeer
  • 156
  • 3
1

Here is my solution. I amed to avoid if-else usage and make the solution more flexible.

The main idea is to create DataFrame of labels with their minimum and maximum values and then find the right label for each score value.

The code:

import pandas as pd


class Label(object):
    name = ''
    min = 0
    max = 100

    def __init__(self, name, min, max):
        self.name = name
        self.min = min
        self.max = max

    def data(self):
        return [self.name, self.min, self.max]


class Labels:
    labels = [
        Label('Bad', 0, 7).data(),
        Label('Good', 7, 8).data(),
        Label('Very good', 8, 100).data()]

    labels_df = pd.DataFrame(labels, columns=['Label', 'Min', 'Max'])

    def get_label(score):
        lbs = Labels.labels_df
        tlab = lbs[(lbs.Min <= score) & (lbs.Max > score)]
        return tlab.Label.values[0]


class edu:
    hs = 'High School'
    b = 'Bachelors'
    m = 'Master'


df = pd.DataFrame({
        'ID': range(6),
        'Education': [edu.hs, edu.b, edu.hs, edu.hs, edu.b, edu.m],
        'Score': [7.884, 6.952, 8.185, 6.556, 6.347, 6.794]})

df['Label'] = df.apply(lambda row: Labels.get_label(row['Score']), axis=1)

print(df)

Output:

   ID    Education  Score      Label
0   0  High School  7.884       Good
1   1    Bachelors  6.952        Bad
2   2  High School  8.185  Very good
3   3  High School  6.556        Bad
4   4    Bachelors  6.347        Bad
5   5       Master  6.794        Bad
Roman
  • 411
  • 3
  • 10