0

I have a dataset in a dataframe form which the first column is a text and the second one it's an author. Authors are the labels for a classification task. I want to convert this column into numbers.

I tried to use the following code from How to convert string labels to numeric values

train['author'].apply(train['author'].index)

but it's not working. The output is

Int64Index object is not callable

Could you please help me?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

3 Answers3

3

IIUC, you're trying to create numerical categories for each author. If so, try:

train["codes"] = train["author"].astype("category").cat.codes

If you then want to apply the same codes to other datasets, you could do:

mapper = train.set_index('author')["codes"].to_dict()
validation["codes"] = validation["author"].map(mapper)
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • Thank you so much, if I want to covert the validation set with the same encoding how can I do it? Thank you so much for your help!!!! –  Sep 10 '21 at 18:05
  • I was thinking that I cound create an encoding by using df["author"].astype("category").cat.codes, where df['authors'] contains all the labels and then convert the training, validation and testing sets with the same encoding –  Sep 10 '21 at 18:07
  • @JohnAngelopoulos - See the edit. Is that what you need? – not_speshal Sep 10 '21 at 18:12
  • yeah, thanks a lot. You could also replace the names by intengers without creating a new column, by just doing validation['authors'] = validation["author"].map(mapper). I think there isn't any problem if you do that. Right? –  Sep 11 '21 at 09:59
  • 1
    Yes but in case you don't want to overwrite your underlying data, it's better to create a new column (even to check if everything maps correctly etc.) – not_speshal Sep 11 '21 at 12:37
0
train['author'].apply(train['author'].tolist().index)
phi friday
  • 191
  • 4
  • Thank you so much. One more quick question, and if then I want to convert with the same encoding the validation set, I should do train['author'].apply(validation['author'].tolist().index), right? –  Sep 10 '21 at 16:54
  • if validation.author has all elements of train.author, – phi friday Sep 10 '21 at 23:10
0

if you want there is also the ord() built in.

for i in author:
  ord(i)

you'll have to do each individual letter.

TSirico
  • 35
  • 5