Aggregate and Convert categorical data to numbers

Question

I have a data frame df_train which has a column sub_division.

The values in the column is look like below

ABC_commercial,
ABC_Private,
Test ROM DIV,
ROM DIV,
TEST SEC R&OM

I am trying to 1. convert anything starts with ABC* to a number (for ex: 1) 2. convert anything contains ROM and R&OM to a number (for ex: 2)

Thanks in advance.

Expected result:

1,
1,
2,
2,
2

the magic word is called Label encoder for you: https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn — PV8, Jun 17 '19 at 07:02

score 1 · Answer 1 · answered Jun 17 '19 at 07:04

Use numpy.select with Series.str.startswith and Series.str.contains:

m1 = df['col'].str.startswith('ABC')
m2 = df['col'].str.contains('ROM|R&OM')

df['new'] = np.select([m1, m2], [1,2], default='no match')
#if need all numbers
#df['new'] = np.select([m1, m2], [1,2], default=0)
print (df)
               col new
0  ABC_commercial,   1
1     ABC_Private,   1
2    Test ROM DIV,   2
3         ROM DIV,   2
4    TEST SEC R&OM   2

score 0 · Answer 2 · answered Jun 17 '19 at 07:11

You can do something like below. Remember you will get NaN if there is no match. You can add else case in the converter function to get default value.

def converter(v):
    if v.startswith('ABC'):
        return 1
    elif any(i in v for i in ['ROM', 'R&OM']):
        return 2

df['sub_division'] = df['sub_division'].apply(converter)
print(df.head(10))

output:

   sub_division
0             1
1             1
2             2
3             2
4             2

score 0 · Answer 3 · answered Jun 17 '19 at 08:17

0

You can use:

df.loc[df['col'].str.startswith('ABC'), 'col'] = 1
df.loc[df['col'].str.contains(r'ROM|R&OM', na=False), 'col'] = 2

answered Jun 17 '19 at 08:17

Mykola Zotko

15,583
3
71
73

Aggregate and Convert categorical data to numbers

3 Answers3