1

I have a data frame like this:

Name  subname Feature1  Feature2 ...
AAA     a     0.123     0.345 ...
AAA     b     0.123     0.345 ...
BBB     a     0.123     0.345 ...
BBB     b     0.123     0.345 ...

I want to create labels (adding a new column):

Name  subname Feature1  Feature2 ...Class
AAA     a     0.123     0.345 ...    1
AAA     b     0.123     0.345 ...    1
BBB     a     0.123     0.345 ...    2
BBB     b     0.123     0.345 ...    2

So that I can fit the data into a classification model, is there a way that I can create those labels in an efficient way? I got more than 5000 rows, many thanks.

Cecilia
  • 309
  • 2
  • 12
  • 2
    What are you encoding off of, just `Name`? If so `df['Class'] = pd.factorize(df['Name'])[0] + 1`, if it is a `2D` factorization you can use `np.unique` with `return_inverse` – user3483203 Jul 22 '19 at 15:48
  • You have the labels in a seperate dataframe or series or what? – adnanmuttaleb Jul 22 '19 at 15:50
  • yes only according the 'Name', have the labels in the last column so that I can fit in a GDBT model to select features, is it the proper way to do that? – Cecilia Jul 22 '19 at 15:51
  • 1
    Both answers in the dupe are perfectly valid, you can choose which to use – user3483203 Jul 22 '19 at 15:53
  • 1
    @user3483203 I checked some tutorials on Google, can I use something like 'from sklearn.preprocessing import LabelEncoder'? Would this be the same? Many thanks. – Cecilia Jul 22 '19 at 16:07

1 Answers1

1

You can try

labels, uniques = pd.factorize(df['Name'].tolist())
df['labels'] = labels

and will get an array([0, 0, 1, 1])

Lumos
  • 570
  • 1
  • 11
  • 24