How to assign a number for ever unique element?

Question

I have a data frame like -

No.     Alphabet
 1.       A
 2.       B
 3.       A
 4.       A
 5.       C                 
 6.       B
 7.       C

Now, I want to add a new column outcome which would give a new number to every unique element. So the final table would be

No.     Alphabet   Outcome
 1.       A           1
 2.       B           2
 3.       A           1
 4.       A           1    
 5.       C           3                     
 6.       B           2 
 7.       C           3

How can I achieve that with R?

May I ask why you want to do this? It's possible that you're doing this as an intermediate step in something and you might be able to skip it entirely if we know what you're really trying to do. — Dason, May 01 '15 at 19:40
@Dason I want to apply KNN algo to this. This is not exactly my data frame, but I have recreated it to make it simple. As KNN does not accept character inputs, I need to convert these variables into numbers and now I can apply KNN on it. — Ronak Shah, May 02 '15 at 04:49
I believe if you store it as factor instead of character (and in this case it really probably should be a factor anyways) that you don't need to do an explicit conversion to numeric. This would be the better approach to take. — Dason, May 02 '15 at 17:35

score 6 · Accepted Answer · answered May 01 '15 at 19:17

6

You can use as.numeric(factor(.)), like this:

> Letter <- c("A", "A", "B", "C", "B", "A")
> as.numeric(factor(Letter))
[1] 1 1 2 3 2 1

Assigning as a column can be done using the standard mydf$outcome <- etc or your favorite/preferred approach.

answered May 01 '15 at 19:17

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

Mine's faster =P (+1) – BrodieG May 01 '15 at 19:30
@BrodieG, don't make me whip out `fmatch` now.... – A5C1D2H2I1M1N2O1R2T1 May 01 '15 at 19:33

akrun · Answer 2 · 2015-05-01T20:03:14.593

You could also do

library(data.table)
setDT(df1)[, Outcome:= .GRP, Alphabet][]
#    No. Alphabet Outcome
#1:   1        A       1
#2:   2        B       2
#3:   3        A       1
#4:   4        A       1
#5:   5        C       3
#6:   6        B       2
#7:   7        C       3

Benchmarks

library(fastmatch)
set.seed(24)
df2 <- data.frame(No = 1:1e7, Alphabet= sample(LETTERS, 1e7, 
            replace=TRUE), stingsAsFactors=FALSE)
df3 <- copy(df2)
Ananda <- function() {transform(df2, 
             outcome = as.numeric(factor(df2$Alphabet)))}
Brodie <- function() {transform(df2, outcome=match(Alphabet, Alphabet))}
Brodie2 <- function(){transform(df2, outcome=fmatch(Alphabet, Alphabet))}

akrun <- function() {setDT(df3)[, Outcome:= .GRP, Alphabet][]}

library(microbenchmark)
microbenchmark(Ananda(), Brodie(), Brodie2(), akrun(), 
                    unit='relative', times=20L)
# Unit: relative
#    expr      min       lq     mean   median       uq      max neval cld
# Ananda() 4.957064 5.150724 4.427514 4.971581 3.336064 4.622502    20   c
# Brodie() 4.473689 5.074105 4.838985 5.383722 4.641304 4.383919    20   c
#Brodie2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20 a  
#  akrun() 1.609863 2.047646 1.665557 1.949590 1.331554 1.290921    20  b 


 system.time(akrun())
 #  user  system elapsed 
 # 0.197   0.005   0.202 

 system.time(Brodie2())
 #  user  system elapsed 
 # 0.081   0.014   0.095

Thanks for running through all the benchmarks, very helpful (+1, though you got that a while ago). — BrodieG, May 01 '15 at 20:55

score 3 · Answer 3 · edited May 01 '15 at 20:04

Another option (for fun) using match:

match(Alphabet, Alphabet)

match only matches the first occurrence, so this works, though the numbers will not be 1:26. If they must absolutely be 1:26, and not just unique:

match(Alphabet, unique(Alphabet))

To actually do what you want (adding a column in data frame, etc.):

transform(DF, outcome=match(Alphabet, Alphabet))

Or

transform(DF, outcome=match(Alphabet, unique(Alphabet)))

Or you can use a faster version of match ie. fmatch from library(fastmatch)

library(fastmatch)
transform(DF, outcome=fmatch(Alphabet, unique(Alphabet)))
#  No. Alphabet outcome
#1   1        A       1
#2   2        B       2
#3   3        A       1
#4   4        A       1
#5   5        C       3
#6   6        B       2
#7   7        C       3

This is actually a little faster than the factor version:

> x <- sample(letters, 1e5, rep=T)
> library(microbenchmark)
> microbenchmark(as.numeric(factor(x)), match(x, x))
Unit: milliseconds
                  expr     min       lq     mean   median       uq      max neval
 as.numeric(factor(x)) 4.68927 4.792212 9.042732 4.915268 5.175275 64.65473   100
           match(x, x) 3.55855 3.617609 6.981944 3.731522 3.922048 53.07911   100

most likely because factor internally uses something like match(x, unique(x)) anyway.

score 2 · Answer 4 · answered May 01 '15 at 19:19

2

Let's say your data frame is called dat. Then you can do

dat$Outcome <- as.numeric(as.factor(dat$Alphabet))

answered May 01 '15 at 19:19

blakeoft

2,370
1
14
15

How to assign a number for ever unique element?

4 Answers4

Benchmarks

Linked