0

I have a data frame df_train which has a column sub_division.

The values in the column is look like below

ABC_commercial,
ABC_Private,
Test ROM DIV,
ROM DIV,
TEST SEC R&OM

I am trying to 1. convert anything starts with ABC* to a number (for ex: 1) 2. convert anything contains ROM and R&OM to a number (for ex: 2)

Thanks in advance.

Expected result:

1,
1,
2,
2,
2
Praveenkumar
  • 2,056
  • 1
  • 9
  • 18
Prdp
  • 47
  • 6
  • the magic word is called Label encoder for you: https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn – PV8 Jun 17 '19 at 07:02

3 Answers3

1

Use numpy.select with Series.str.startswith and Series.str.contains:

m1 = df['col'].str.startswith('ABC')
m2 = df['col'].str.contains('ROM|R&OM')

df['new'] = np.select([m1, m2], [1,2], default='no match')
#if need all numbers
#df['new'] = np.select([m1, m2], [1,2], default=0)
print (df)
               col new
0  ABC_commercial,   1
1     ABC_Private,   1
2    Test ROM DIV,   2
3         ROM DIV,   2
4    TEST SEC R&OM   2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

You can do something like below. Remember you will get NaN if there is no match. You can add else case in the converter function to get default value.

def converter(v):
    if v.startswith('ABC'):
        return 1
    elif any(i in v for i in ['ROM', 'R&OM']):
        return 2

df['sub_division'] = df['sub_division'].apply(converter)
print(df.head(10))

output:

   sub_division
0             1
1             1
2             2
3             2
4             2
Praveenkumar
  • 2,056
  • 1
  • 9
  • 18
0

You can use:

df.loc[df['col'].str.startswith('ABC'), 'col'] = 1
df.loc[df['col'].str.contains(r'ROM|R&OM', na=False), 'col'] = 2
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73