Label encode subgroups after groupby

Question

I want to label encode subgroups in a pandas dataframe. Something like this:

| Category   | | Name      |
| ---------- | | --------- | 
| FRUITS     | | Apple     |
| FRUITS     | | Orange    |
| FRUITS     | | Apple     |
| Vegetables | | Onion     |
| Vegetables | | Garlic    |
| Vegetables | | Garlic    |

to

| Category   | | Name    | | Label |
| ---------- | | ------- | | ----- |
| FRUITS     | | Apple   | | 1     |
| FRUITS     | | Orange  | | 2     |
| FRUITS     | | Apple   | | 1     |
| Vegetables | | Onion   | | 1     |
| Vegetables | | Garlic  | | 2     |
| Vegetables | | Garlic  | | 2     |

score 1 · Answer 1 · answered Nov 05 '22 at 21:12

1

Try to group-by "Category" and then group-by "Name" and use .ngroup():

df["Label"] = (
    df.groupby("Category")
    .apply(lambda x: x.groupby("Name", sort=False).ngroup() + 1)
    .values
)
print(df)

Prints:

     Category    Name  Label
0      FRUITS   Apple      1
1      FRUITS  Orange      2
2      FRUITS   Apple      1
3  Vegetables   Onion      1
4  Vegetables  Garlic      2
5  Vegetables  Garlic      2

answered Nov 05 '22 at 21:12

Andrej Kesely

168,389
15
48
91

I tried this but got a size mismatch error: ValueError: Length of values (21379) does not match length of index (21436) – rohit deraj Nov 05 '22 at 21:32
1

@rohitderaj Do you have `NaN`s in `Category` column? Try to clean your dataframe first of NaNs. – Andrej Kesely Nov 05 '22 at 21:35

score 0 · Answer 2 · answered Nov 06 '22 at 06:15

You can use factorize per group:

df['Label'] = (df.groupby('Category')['Name']
               .transform(lambda x: pd.factorize(x)[0])
               .add(1)
               )

Output:

     Category    Name  Label
0      FRUITS   Apple      1
1      FRUITS  Orange      2
2      FRUITS   Apple      1
3  Vegetables   Onion      1
4  Vegetables  Garlic      2
5  Vegetables  Garlic      2

Label encode subgroups after groupby

2 Answers2