1

I am very new to R and to programming in general. I've just begun to learn how to use for loops, but I can't figure out how to get the variable I want to print as part of my dataframe.

I have data that look like this:

Place  Sex  Length
A      M    32
A      M    33
A      F    35
A      F    35
A      F    35
A      F    39
B      M    30
B      F    25
B      F    28
B      F    28

I would like to create a fourth variable in my dataframe that gives each line of data a unique identifier that is specific to it's Place/Sex/Length combination so that my data look like this and so each individual has a unique Place/Sex/Length/ID combination that is specific to that line of data only:

Place  Sex  Length Ind
A      M    32     1
A      M    33     1
A      F    35     1
A      F    35     2
A      F    35     3
A      F    39     1
B      M    30     1
B      F    25     1
B      F    28     1
B      F    28     2

Thank you in advance for any suggestions. I've been searching for a while for some help on how to do this with no luck.

Arun
  • 116,683
  • 26
  • 284
  • 387
user2145867
  • 43
  • 1
  • 3

3 Answers3

4

One (of many) ways is to use ave in base R, as follows (assuming a data.frame named "temp")

within(temp, {
  ID <- ave(as.character(interaction(temp)), 
            interaction(temp), FUN = seq_along)
})
#    Place Sex Length ID
# 1      A   M     32  1
# 2      A   M     33  1
# 3      A   F     35  1
# 4      A   F     35  2
# 5      A   F     35  3
# 6      A   F     39  1
# 7      B   M     30  1
# 8      B   F     25  1
# 9      B   F     28  1
# 10     B   F     28  2

Try running interaction(temp) to get an idea of what it is doing.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
3

Another way:

# assuming the data.frame is already sorted by 
# all three columns (unfortunately, this is a requirement)
> sequence(rle(do.call(paste, df))$lengths)
# [1] 1 1 1 2 3 1 1 1 1 2

Break down:

do.call(paste, df) # pastes each row of df together with default separator "space"
#  [1] "A M 32" "A M 33" "A F 35" "A F 35" "A F 35" "A F 39" "B M 30" "B F 25" "B F 28"
# [10] "B F 28"

rle(.) # gets the run length vector 
# Run Length Encoding
#   lengths: int [1:7] 1 1 3 1 1 1 2
#   values : chr [1:7] "A M 32" "A M 33" "A F 35" "A F 39" "B M 30" "B F 25" "B F 28"

$lengths # get the run-lengths (as opposed to values)
# [1] 1 1 3 1 1 1 2

sequence(.) # get 1:n for each n 
# [1] 1 1 1 2 3 1 1 1 1 2

Benchmarking:

Since there are quite a few solutions, I thought I'd benchmark this on a relatively huge data.frame. So, here are the results (I also added a solution data.table).

Here's the data:

require(data.table)
require(plyr)
set.seed(45)

length <- 1e3 # number of rows in `df`
df <- data.frame(Place = sample(letters[1:20], length, replace=T), 
                 Sex = sample(c("M", "F"), length, replace=T), 
                 Length = sample(1:75, length, replace=T))
df <- df[with(df, order(Place, Sex, Length)), ]

Ananda's ave solution:

AVE_FUN <- function(x) {
    i <- interaction(x)
    within(x, {
        ID <- ave(as.character(i), i, FUN = seq_along)
    })
}

Arun's rle solution:

RLE_FUN <- function(x) {
    x <- transform(x, ID = sequence(rle(do.call(paste, df))$lengths))
}

Ben's plyr solution:

PLYR_FUN <- function(x) {
    ddply(x, c("Place", "Sex", "Length"), transform, ID = seq_along(Length))
}

At last, the data.table solution:

DT_FUN <- function(x) {
    dt <- data.table(x)
    dt[, ID := seq_along(.I), by=names(dt)]
}

Benchmarking code:

require(rbenchmark)
benchmark(d1 <- AVE_FUN(df), 
          d2 <- RLE_FUN(df), 
          d3 <- PLYR_FUN(df), 
          d4 <- DT_FUN(df), 
 replications = 5, order = "elapsed")

Results:

With length = 1e3 (number of rows in data.frame df)

#                 test replications elapsed relative user.self 
# 2  d2 <- RLE_FUN(df)            5   0.013    1.000     0.013 
# 4   d4 <- DT_FUN(df)            5   0.017    1.308     0.016 
# 1  d1 <- AVE_FUN(df)            5   0.052    4.000     0.052 
# 3 d3 <- PLYR_FUN(df)            5   4.629  356.077     4.452 

With length = 1e4:

#                test replications elapsed relative user.self
# 4   d4 <- DT_FUN(df)            5   0.033    1.000     0.031
# 2  d2 <- RLE_FUN(df)            5   0.089    2.697     0.088
# 1  d1 <- AVE_FUN(df)            5   0.102    3.091     0.100
# 3 d3 <- PLYR_FUN(df)            5  23.103  700.091    20.659

With length = 1e5:

#                test replications elapsed relative user.self
# 4   d4 <- DT_FUN(df)            5   0.179    1.000     0.130
# 1  d1 <- AVE_FUN(df)            5   1.001    5.592     0.940
# 2  d2 <- RLE_FUN(df)            5   1.098    6.134     1.011
# 3 d3 <- PLYR_FUN(df)            5 219.861 1228.274   147.545

Observation: The trend I notice is that with bigger and bigger data, data.table (not surprisingly) does the best (scales really well), while ave and rle being quite close competitors for second place (ave scales better than rle). plyr performs quite bad on all datasets, unfortunately.

Note: Ananda's solution gives character output and I kept it as such in the benchmarking.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    Don't include the creation of the data.table in the function. `dt[,ID := seq_len(.N), by = names(DT)]` may also be faster – mnel Mar 07 '13 at 22:48
  • @mnel, I remember MatthewDowle mentioning something similar to include `, key(.)` to be fair [here](http://stackoverflow.com/questions/15182888/complicated-reshaping). I assumed it should include creating the data.table under the comments to Ricardo's benchmarking. – Arun Mar 07 '13 at 22:55
  • if you want to take advantage of the sorting and speed-up from setting the key, then include the setkey(), but you aren't in this case, so I think it is unfair overhead (as opposed to fair overhead if you were setting the key) – mnel Mar 07 '13 at 22:59
  • The speed seems to be almost the same (0.149 with .N, 0.151 seconds with .I for 1e5 rows). The results are identical (sorry for the earlier wrong report). – Arun Mar 07 '13 at 23:12
  • @Arun, thanks for the benchmarks. Pretty interesting. `ave` *might* scale even faster if your `do.call(paste...` approach is used instead of `interaction`, or, specifying `drop = TRUE` in `interaction` to drop unused factor levels. – A5C1D2H2I1M1N2O1R2T1 Mar 08 '13 at 05:15
3

The inevitable plyr solution.

Get data:

temp <- read.table(text="
Place  Sex  Length
A      M    32
A      M    33
A      F    35
A      F    35
A      F    35
A      F    39
B      M    30
B      F    25
B      F    28
B      F    28",
header=TRUE)

Load package and Do It:

library("plyr")
ddply(temp,c("Place","Sex","Length"),transform,ID=seq_along(Length))

The order has changed (you can use arrange() to re-order it if you want), but the variables should be right:

##        Place Sex Length ID
## 1      A   F     35  1
## 2      A   F     35  2
## 3      A   F     35  3
## 4      A   F     39  1
## 5      A   M     32  1
## 6      A   M     33  1
## 7      B   F     25  1
## 8      B   F     28  1
## 9      B   F     28  2
## 10     B   M     30  1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I just benchmarked on relatively bigger data.frames.. and `plyr` seems to perform quite badly. Please check my post for the results. – Arun Mar 07 '13 at 21:40
  • 3
    It's nice to have the benchmarks -- but it's also not really surprising. `plyr` was designed (AIUI) for conceptual simplicity and convenience, not necessarily raw speed. Unlike many of the SO user base, I don't live in a Big Data world -- I can live with 5 seconds for a 1000-row data frame ... but `data.table` and base-R solutions are certainly worth considering if one does have a need for speed ... – Ben Bolker Mar 07 '13 at 22:22