2

I have a data frame like -

No.     Alphabet
 1.       A
 2.       B
 3.       A
 4.       A
 5.       C                 
 6.       B
 7.       C

Now, I want to add a new column outcome which would give a new number to every unique element. So the final table would be

No.     Alphabet   Outcome
 1.       A           1
 2.       B           2
 3.       A           1
 4.       A           1    
 5.       C           3                     
 6.       B           2 
 7.       C           3

How can I achieve that with R?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    May I ask why you want to do this? It's possible that you're doing this as an intermediate step in something and you might be able to skip it entirely if we know what you're really trying to do. – Dason May 01 '15 at 19:40
  • @Dason I want to apply KNN algo to this. This is not exactly my data frame, but I have recreated it to make it simple. As KNN does not accept character inputs, I need to convert these variables into numbers and now I can apply KNN on it. – Ronak Shah May 02 '15 at 04:49
  • Which function from which package are you using for knn? – Dason May 02 '15 at 14:43
  • function knn from class package. – Ronak Shah May 02 '15 at 16:36
  • I believe if you store it as factor instead of character (and in this case it really probably should be a factor anyways) that you don't need to do an explicit conversion to numeric. This would be the better approach to take. – Dason May 02 '15 at 17:35

4 Answers4

6

You can use as.numeric(factor(.)), like this:

> Letter <- c("A", "A", "B", "C", "B", "A")
> as.numeric(factor(Letter))
[1] 1 1 2 3 2 1

Assigning as a column can be done using the standard mydf$outcome <- etc or your favorite/preferred approach.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

You could also do

library(data.table)
setDT(df1)[, Outcome:= .GRP, Alphabet][]
#    No. Alphabet Outcome
#1:   1        A       1
#2:   2        B       2
#3:   3        A       1
#4:   4        A       1
#5:   5        C       3
#6:   6        B       2
#7:   7        C       3

Benchmarks

library(fastmatch)
set.seed(24)
df2 <- data.frame(No = 1:1e7, Alphabet= sample(LETTERS, 1e7, 
            replace=TRUE), stingsAsFactors=FALSE)
df3 <- copy(df2)
Ananda <- function() {transform(df2, 
             outcome = as.numeric(factor(df2$Alphabet)))}
Brodie <- function() {transform(df2, outcome=match(Alphabet, Alphabet))}
Brodie2 <- function(){transform(df2, outcome=fmatch(Alphabet, Alphabet))}

akrun <- function() {setDT(df3)[, Outcome:= .GRP, Alphabet][]}

library(microbenchmark)
microbenchmark(Ananda(), Brodie(), Brodie2(), akrun(), 
                    unit='relative', times=20L)
# Unit: relative
#    expr      min       lq     mean   median       uq      max neval cld
# Ananda() 4.957064 5.150724 4.427514 4.971581 3.336064 4.622502    20   c
# Brodie() 4.473689 5.074105 4.838985 5.383722 4.641304 4.383919    20   c
#Brodie2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20 a  
#  akrun() 1.609863 2.047646 1.665557 1.949590 1.331554 1.290921    20  b 


 system.time(akrun())
 #  user  system elapsed 
 # 0.197   0.005   0.202 

 system.time(Brodie2())
 #  user  system elapsed 
 # 0.081   0.014   0.095 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thanks for running through all the benchmarks, very helpful (+1, though you got that a while ago). – BrodieG May 01 '15 at 20:55
3

Another option (for fun) using match:

match(Alphabet, Alphabet)

match only matches the first occurrence, so this works, though the numbers will not be 1:26. If they must absolutely be 1:26, and not just unique:

match(Alphabet, unique(Alphabet))

To actually do what you want (adding a column in data frame, etc.):

transform(DF, outcome=match(Alphabet, Alphabet))

Or

transform(DF, outcome=match(Alphabet, unique(Alphabet)))

Or you can use a faster version of match ie. fmatch from library(fastmatch)

library(fastmatch)
transform(DF, outcome=fmatch(Alphabet, unique(Alphabet)))
#  No. Alphabet outcome
#1   1        A       1
#2   2        B       2
#3   3        A       1
#4   4        A       1
#5   5        C       3
#6   6        B       2
#7   7        C       3

This is actually a little faster than the factor version:

> x <- sample(letters, 1e5, rep=T)
> library(microbenchmark)
> microbenchmark(as.numeric(factor(x)), match(x, x))
Unit: milliseconds
                  expr     min       lq     mean   median       uq      max neval
 as.numeric(factor(x)) 4.68927 4.792212 9.042732 4.915268 5.175275 64.65473   100
           match(x, x) 3.55855 3.617609 6.981944 3.731522 3.922048 53.07911   100

most likely because factor internally uses something like match(x, unique(x)) anyway.

akrun
  • 874,273
  • 37
  • 540
  • 662
BrodieG
  • 51,669
  • 9
  • 93
  • 146
2

Let's say your data frame is called dat. Then you can do

dat$Outcome <- as.numeric(as.factor(dat$Alphabet))
blakeoft
  • 2,370
  • 1
  • 14
  • 15